Introduction

With the growing presence of technology in our society, there is an rapidly increasing demand for hardware which supports our heavy computational demands. One of the most important pieces of computer hardware for computationally-intensive tasks is Graphics Processing Units (GPU) due to their ability to handle a wide range of parallel processing tasks — this has made them an invaluable resource for companies pursing any sort of Artificial Intelligence (AI), super-computing, crypto-currencies, or computer graphics. Unfortunately, the materials needed to produce GPUs are somewhat scarce, thus leading to a small pool of manufacturers that experience significant competition.

The purpose of this project is to attempt to predict the price-trends of a fixed semiconductor stock (in this case, that of NVIDIA) based on the performance of its competitors, previous pricing, and volume of shares sold. A variety of statistical learning models will be used, ranging from standard regression techniques to more non-linear models like random forest learning and k-Nearest neighbors.

Loading Packages and Data

knitr::opts_chunk$set(echo = TRUE)
library(tidyverse)
library(tidymodels)
library(ggplot2)
library(corrplot)
library(discrim)
library(ggthemes)
library(kableExtra)
library(yardstick)
library(visdat)
library(scales)
library(glmnet)


tidymodels_prefer()
conflicted::conflicts_prefer(yardstick::rsq)
set.seed(3435)

This dataset is comprised of 1 year’s worth ( 252 business / trading days ) of New York Stock Exchange data for twelve of the most popular semiconductor manufacturers: Advanced Micro Devices (AMD), Applied Materials Inc. (AMAT), ASML Holding N.V. (ASML), Broadcom Inc. (AVGO), Intel Corporation (INTC), Monolithic Power Systems Inc. (MPWR), Nvidia Corp. (NVDA), NXP Semiconductors NV (NXPI), On Semiconductor Corp. (ON), Qualcomm Inc. (QCOM), Taiwan Semiconductor Manufacturing Co. Ltd. (TSM), and Texas Instruments Inc (TXN).

All stock market data was obtained from Yahoo Finance. Each company’s one-year historical stock data was individually pulled from Yahoo’s historical data on April 12th, 2024. For example, AMD’s stock prices were obtained by downloading the CSV file from AMD’s Historical Data page, which results in a dataframe with the following variables and entries:

read.csv("data/AMD.csv") %>%
  head() %>%
  kable() %>% 
  kable_styling(full_width = F) %>% 
  scroll_box(width = "100%", height = "200px")
Date Open High Low Close Adj.Close Volume
2023-04-13 92.79 93.16 91.83 92.09 92.09 40572500
2023-04-14 91.82 92.97 90.50 91.75 91.75 38734800
2023-04-17 90.23 90.69 88.30 89.87 89.87 47250800
2023-04-18 91.61 92.16 89.33 89.78 89.78 46246300
2023-04-19 88.51 90.54 88.22 89.94 89.94 37344500
2023-04-20 88.83 91.58 88.73 90.11 90.11 47082700

However, since one goal of this analysis is to test the affect of competitor’s stock performance on a fixed GPU manufacturer’s stock price, multiple CSV files must be stored into raw data. The easiest way to do this was to create a separate CSV file, with header columns renamed to both resolve variable name conflicts and to distinguish the data specific to certain stocks. This was simply done by adding the stock’s symbol (i.e. AMD, INTC, etc.) to the beginning of the original variable name:

# Read the data into a dataframe variable 'SSD'
SSD <- read.csv("data/semiconductor_stock_data_mod.csv")
SSD$Date <- as.Date(SSD$Date, format="%m/%d/%y")

SSD %>%
  head() %>%
  kable() %>% 
  kable_styling(full_width = F) %>% 
  scroll_box(width = "100%", height = "200px")
Date NVDA_Open NVDA_High NVDA_Low NVDA_Close NVDA_Adj_Close NVDA_Volume TSM_Open TSM_High TSM_Low TSM_Close TSM_Adj_Close TSM_Volume NXPI_Open NXPI_High NXPI_Low NXPI_Close NXPI_Adj_Close NXPI_Volume QCOM_Open QCOM_High QCOM_Low QCOM_Close QCOM_Adj_Close QCOM_Volume MPWR_Open MPWR_High MPWR_Low MPWR_Close MPWR_Adj_Close MPWR_Volume ON_Open ON_High ON_Low ON_Close ON_Adj_Close ON_Volume AMD_Open AMD_High AMD_Low AMD_Close AMD_Adj_Close AMD_Volume INTC_Open INTC_High INTC_Low INTC_Close INTC_Adj_Close INTC_Volume AVGO_Open AVGO_High AVGO_Low AVGO_Close AVGO_Adj_Close AVGO_Volume ASML_Open ASML_High ASML_Low ASML_Close ASML_Adj_Close ASML_Volume AMAT_Open AMAT_High AMAT_Low AMAT_Close AMAT_Adj_Close AMAT_Volume TXN_Open TXN_High TXN_Low TXN_Close TXN_Adj_Close TXN_Volume
2021-04-12 142.8975 153.5250 141.3925 152.0900 151.7933 86932400 122.21 122.46 119.24 120.90 114.2340 9868400 207.98 208.22 204.87 207.94 197.1795 1686200 138.86 139.89 136.05 137.44 128.6173 10355500 375.41 377.50 367.59 373.92 366.0652 283100 42.42 42.70 41.66 42.40 42.40 4151300 82.06 82.18 78.03 78.58 78.58 62098800 68.20 68.49 64.71 65.41 59.86862 51266900 481.93 485.42 478.60 483.67 446.5351 2324700 631.56 631.58 620.62 630.43 610.5892 739500 137.82 138.73 134.51 135.00 131.6226 11147500 192.64 194.72 191.34 192.43 175.6870 4504800
2021-04-13 152.3150 157.0000 151.2575 156.7950 156.4891 67621200 122.40 122.90 120.35 121.27 114.5836 8384800 207.32 207.79 200.72 202.00 191.5468 3033300 138.38 138.77 135.75 137.30 128.4863 9225200 374.97 379.23 371.30 376.52 368.6105 192700 42.54 42.85 41.55 42.04 42.04 3625300 79.67 80.72 78.98 80.19 80.19 37767300 65.61 65.63 64.21 65.22 59.69471 26822000 485.00 488.22 480.29 484.96 447.7260 1528100 635.63 636.74 623.72 629.12 609.3205 710500 136.63 136.99 133.20 135.10 131.7201 8034300 192.14 193.00 189.76 191.24 174.6006 4009300
2021-04-14 156.2500 157.2050 152.2750 152.7700 152.4719 38550000 121.99 122.43 120.50 120.84 114.1773 9521900 201.20 203.33 198.59 199.89 189.5460 2879000 137.08 137.84 133.91 134.75 126.1000 9967800 375.00 383.00 369.10 369.92 362.1492 251900 41.59 43.17 41.44 42.10 42.10 3889800 79.88 80.13 77.94 78.55 78.55 34263800 65.31 65.38 63.84 64.19 58.75197 25768400 482.47 489.19 475.19 477.30 440.6541 1822000 635.67 641.09 627.04 630.99 611.1315 718000 134.67 137.14 133.24 134.14 130.7841 8134200 190.46 191.50 189.01 190.33 173.7697 3555000
2021-04-15 156.6250 162.1425 156.3150 161.3725 161.0576 59848000 121.70 122.00 116.56 118.35 111.8246 18709100 202.76 202.76 198.53 201.77 191.3288 1858100 136.00 137.99 135.57 137.84 128.9916 11733200 374.64 383.64 374.32 381.57 373.5545 237000 42.44 42.90 41.94 42.63 42.63 3995000 80.32 83.95 79.97 83.01 83.01 68942800 63.97 65.22 63.68 65.02 59.51164 24927700 481.64 482.31 476.78 480.00 443.1469 1837000 633.78 642.90 627.52 642.09 621.8823 980400 136.00 136.14 132.85 134.41 131.0473 8269400 191.93 193.53 190.82 193.17 176.3626 4471900
2021-04-16 160.5300 161.6575 158.6525 159.1250 158.8145 33520800 119.19 120.60 117.85 118.84 112.2876 9512100 201.19 202.08 199.01 199.38 189.0624 2187600 137.62 139.01 136.65 138.21 129.3378 6583900 382.12 386.52 375.38 378.62 370.6664 322000 42.63 42.77 42.07 42.18 42.18 4630300 83.30 83.59 81.53 82.15 82.15 47280600 65.33 65.52 64.57 64.75 59.26453 24625500 480.48 481.78 476.78 478.79 442.0297 1626300 640.26 647.94 638.48 645.69 625.3690 605200 133.50 134.74 133.01 133.73 130.3843 7686300 193.66 194.78 191.64 191.93 175.2305 5792900
2021-04-19 155.3650 158.0750 152.3300 153.6175 153.3178 40442000 118.00 118.88 115.20 115.40 109.0372 12630300 199.37 199.56 191.99 194.76 184.6815 2686000 136.90 137.05 134.09 135.25 126.5678 8728900 377.03 379.21 362.76 369.60 361.8359 238900 42.00 42.53 40.58 41.07 41.07 4554600 82.13 83.18 80.39 81.11 81.11 39115500 64.70 64.74 63.07 63.63 58.23941 23997700 476.53 476.80 460.05 462.00 426.5288 2631900 637.64 639.26 622.45 630.11 610.2793 1138900 133.39 135.28 128.70 130.89 127.6153 12826400 190.35 191.10 186.72 187.06 170.7843 5334900

This results in a seemingly large initial data frame that contains different 73 predictors and 252 entries (corresponding to the 252 days that the stock market is open throughout the fiscal year). Out of the 73 predictors is 1 date variable (which is formatted as a Date data type using as.Date() ), and 6 predictors for each of the 12 chosen semiconductor manufacturers.

dim(SSD)
## [1] 757  73

Fortunately, a quick analysis shows that there is no missing data among any of the CSV files downloaded. This is somewhat expected though, since stock market data is meant to be as publicly available as possible and the original features are fairly common metrics for financial institutions to collect.

vis_miss(SSD)

Exploratory Data Analysis

We examine the data in terms of the predictors that are given to us, and then see if there are any other possible metrics to analyze our stock prices by. First we wish to explain the relevance of each predictor in the initial data frame, though not all variables will be used in our predictive models due to a high correlation (for example, the previous day’s closing price is heavily correlated to the current day’s opening price). Next, we examine other possible methods to predict our stocks behavior by looking at both historical metrics and normalized metrics.

Describing the Predictors

xx_Open

For each of the 12 semiconductor manufacturers chosen (AMAT, AMD, ASML, AVGO, INTC, MPWR, NVDA, NXPI, ON, QCOM, TSM, and TXN), there is a variable called xx_Open (where xx is one of the above stock symbols) which corresponds to that stock’s opening price for the day. As the New York Stock exchange operates from 9:30AM to 4:00PM, this indicates the stock’s price at 9:30AM that day.

ggplot(data = SSD, aes(x=Date)) +
  geom_line(aes(y = AMD_Open, color = 'AMD')) +
  geom_line(aes(y = NXPI_Open, color = 'NXPI')) + 
  geom_line(aes(y = TXN_Open, color = 'TXN')) +
  geom_line(aes(y = AMAT_Open, color = 'AMAT')) +
  scale_color_manual(values = c(
    'AMD' = 'green',
    'AMAT' = 'white',
    'NXPI' = 'pink',
    'TXN' = 'lightblue')) +
  ylab('USD') + 
  ggtitle("Opening Price") +
  theme_dark()


ggplot(data = SSD, aes(x=Date)) +
  geom_line(aes(y = NVDA_Open, color = 'NVDA')) +
  geom_line(aes(y = MPWR_Open, color = 'MPWR')) + 
  geom_line(aes(y = AVGO_Open, color = 'AVGO')) +
  geom_line(aes(y = ASML_Open, color = 'ASML')) +
  scale_color_manual(values = c(
    'NVDA' = 'darkolivegreen1',
    'MPWR' = 'moccasin',
    'AVGO' = 'coral',
    'ASML' = 'gold')) +
  ylab('USD') + 
  ggtitle("Opening Price") +
  theme_dark()

ggplot(data = SSD, aes(x=Date)) +
  geom_line(aes(y = ON_Open, color = 'ON')) +
  geom_line(aes(y = QCOM_Open, color = 'QCOM')) + 
  geom_line(aes(y = INTC_Open, color = 'INTC')) +
  geom_line(aes(y = TSM_Open, color = 'TSM')) +
  scale_color_manual(values = c(
    'ON' = 'cyan',
    'QCOM' = 'purple',
    'INTC' = 'yellow',
    'TSM' = 'red')) +
  ylab('USD') + 
  ggtitle("Opening Price") +
  theme_dark()

xx_Close

Similar to the variable xx_Open, the predictor xx_Close simply represents the manufacturer’s stock price at closing time (4:00PM) of the Stock Exchange that given day. The 252-day trend between the opening and closing prices are almost indistinguishable:

ggplot(data = SSD, aes(x=Date)) +
  geom_line(aes(y = AMD_Close, color = 'AMD')) +
  geom_line(aes(y = NXPI_Close, color = 'NXPI')) + 
  geom_line(aes(y = TXN_Close, color = 'TXN')) +
  geom_line(aes(y = AMAT_Close, color = 'AMAT')) +
  scale_color_manual(values = c(
    'AMD' = 'green',
    'AMAT' = 'white',
    'NXPI' = 'pink',
    'TXN' = 'lightblue')) +
  ylab('USD') + 
  ggtitle("Closing Price") +
  theme_dark()


ggplot(data = SSD, aes(x=Date)) +
  geom_line(aes(y = NVDA_Close, color = 'NVDA')) +
  geom_line(aes(y = MPWR_Close, color = 'MPWR')) + 
  geom_line(aes(y = AVGO_Close, color = 'AVGO')) +
  geom_line(aes(y = ASML_Close, color = 'ASML')) +
  scale_color_manual(values = c(
    'NVDA' = 'darkolivegreen1',
    'MPWR' = 'moccasin',
    'AVGO' = 'coral',
    'ASML' = 'gold')) +
  ylab('USD') + 
  ggtitle("Closing Price") +
  theme_dark()

ggplot(data = SSD, aes(x=Date)) +
  geom_line(aes(y = ON_Close, color = 'ON')) +
  geom_line(aes(y = QCOM_Close, color = 'QCOM')) + 
  geom_line(aes(y = INTC_Close, color = 'INTC')) +
  geom_line(aes(y = TSM_Close, color = 'TSM')) +
  scale_color_manual(values = c(
    'ON' = 'cyan',
    'QCOM' = 'purple',
    'INTC' = 'yellow',
    'TSM' = 'red')) +
  ylab('USD') + 
  ggtitle("Closing Price") +
  theme_dark()

There are ultimately a few noticeable differences close to extrema (maxima and minima) several of the stocks, but this simply reflects the fact that stock prices become volatile after an extended period of growth (i.e a ‘bubble’) or decay.

xx_Adj_Close

The variables of the format xx_Adj_Close represent the ‘Adjusted closing prices’ of the respective stocks; though closely related to the closing price, the adjusted closing price takes into account any corporate actions that stock may have undergone that day. For example, this accounts for stock splits, dividends, and rights offerings. Those with a deeper financial knowledge are sometimes able to leverage the difference between a stock’s closing price and adjusted closing price to establish a metric on a company’s profitability — however, no such techniques will be used in this analysis.

It should also be noted that neither the adjusted closing price nor the regular closing price are necessarily equal to the opening price the next day — this simply reflects the fact that the public’s valuation of a given stock is constantly changing even outside the stock exchange’s usual hours.

xx_High and xx_Low

The variables xx_High and xx_Low represent the maximum and minimum values, respectively, the stock reached on that particular day. Since a continuous plot of stocks’ value is not readily available, taking the difference of these two values (i.e. the stocks movement over a day) is one possible way of predicting how volatile a certain stock is over a period of time.

ggplot(data = SSD, aes(x=Date)) +
  geom_line(aes(y = (AMD_High - AMD_Low), color = 'AMD')) +
  geom_line(aes(y = (NXPI_High - NXPI_Low), color = 'NXPI')) + 
  geom_line(aes(y = (TXN_High - TXN_Low), color = 'TXN')) +
  geom_line(aes(y = (AMAT_High - AMAT_Low), color = 'AMAT')) +
  scale_color_manual(values = c(
    'AMD' = 'green',
    'AMAT' = 'white',
    'NXPI' = 'pink',
    'TXN' = 'lightblue')) +
  ylab('USD') + 
  ggtitle("High - Low") +
  theme_dark()


ggplot(data = SSD, aes(x=Date)) +
  geom_line(aes(y = (NVDA_High - NVDA_Low), color = 'NVDA')) +
  geom_line(aes(y = (MPWR_High - MPWR_Low), color = 'MPWR')) + 
  geom_line(aes(y = (AVGO_High - AVGO_Low), color = 'AVGO')) +
  geom_line(aes(y = (ASML_High - ASML_Low), color = 'ASML')) +
  scale_color_manual(values = c(
    'NVDA' = 'darkolivegreen1',
    'MPWR' = 'moccasin',
    'AVGO' = 'coral',
    'ASML' = 'gold')) +
  ylab('USD') + 
  ggtitle("High - Low") +
  theme_dark()

ggplot(data = SSD, aes(x=Date)) +
  geom_line(aes(y = (ON_High - ON_Low), color = 'ON')) +
  geom_line(aes(y = (QCOM_High - QCOM_Low), color = 'QCOM')) + 
  geom_line(aes(y = (INTC_High - INTC_Low), color = 'INTC')) +
  geom_line(aes(y = (TSM_High - TSM_Low), color = 'TSM')) +
  scale_color_manual(values = c(
    'ON' = 'cyan',
    'QCOM' = 'purple',
    'INTC' = 'yellow',
    'TSM' = 'red')) +
  ylab('USD') + 
  ggtitle("High - Low ") +
  theme_dark()

It should be noted, however, that stocks’ movement (i.e. High minus Low) is not always the best way to directly compare two stocks, since their average prices could vary drastically. For example, the AVGO stock attains values over 1000 USD per share while Intel Corporation (INTC) regularly holds its share price just under $50 — thus, if both stocks fluctuate over a given day by 1% of their total value, the movement of AVGO will appear as significantly more drastic than INTC due to the fact that AVGO’s shares are worth 20 times that of INTC. Ultimately this will not be of importance later on in the model fitting stage, since all numeric variables will be rescaled in the recipe creation.

xx_Volume

Lastly, the variables ending with Volume indicate the number of stock shares that are traded (i.e. either bought or sold) on that given day. As the only predictor in our data-set not measured in terms of a currency, volume gives useful insights into a company’s popularity and thus potential future trends for that stock.

data.frame(name=c("AMAT", "AMD", "ASML", "AVGO", "INTC", "MPWR", "NVDA", "NXPI", "ON", "QCOM", "TSM", "TXN"), vols=c( SSD$AMAT_Volume[1], SSD$AMD_Volume[1], SSD$ASML_Volume[1], SSD$AVGO_Volume[1], SSD$INTC_Volume[1], SSD$MPWR_Volume[1], SSD$NVDA_Volume[1], SSD$NXPI_Volume[1], SSD$ON_Volume[1], SSD$QCOM_Volume[1], SSD$TSM_Volume[1], SSD$TXN_Volume[1] ) ) %>% ggplot( aes(x=name, y=vols)) + 
  geom_bar(stat = "identity") +
  scale_y_continuous(limits = c(0, 80000000), labels = label_comma()) +
  ylab('') + 
  xlab('') +
  ggtitle("Volume of Stocks Sold on 4/13/2023") 

data.frame(name=c("AMAT", "AMD", "ASML", "AVGO", "INTC", "MPWR", "NVDA", "NXPI", "ON", "QCOM", "TSM", "TXN"), vols=c( SSD$AMAT_Volume[126], SSD$AMD_Volume[126], SSD$ASML_Volume[126], SSD$AVGO_Volume[126], SSD$INTC_Volume[126], SSD$MPWR_Volume[126], SSD$NVDA_Volume[126], SSD$NXPI_Volume[126], SSD$ON_Volume[126], SSD$QCOM_Volume[126], SSD$TSM_Volume[126], SSD$TXN_Volume[126] ) ) %>% ggplot( aes(x=name, y=vols)) + 
  geom_bar(stat = "identity") +
  scale_y_continuous(limits = c(0, 80000000), labels = label_comma()) +
  ylab('') + 
  xlab('') +
  ggtitle("Volume of Stocks Sold on 10/10/2023") 

data.frame(name=c("AMAT", "AMD", "ASML", "AVGO", "INTC", "MPWR", "NVDA", "NXPI", "ON", "QCOM", "TSM", "TXN"), vols=c( SSD$AMAT_Volume[252], SSD$AMD_Volume[252], SSD$ASML_Volume[252], SSD$AVGO_Volume[252], SSD$INTC_Volume[252], SSD$MPWR_Volume[252], SSD$NVDA_Volume[252], SSD$NXPI_Volume[252], SSD$ON_Volume[252], SSD$QCOM_Volume[252], SSD$TSM_Volume[252], SSD$TXN_Volume[252] ) ) %>% ggplot( aes(x=name, y=vols)) + 
  geom_bar(stat = "identity") +
  scale_y_continuous(labels = label_comma()) +
  ylab('') + 
  xlab('') +
  ggtitle("Volume of Stocks Sold on 4/12/2024") 

Based on the above plots, one can also see that the semiconductor manufacturing market is primarily dominated by three corporations: AMD, Intel (INTC), and NVIDIA (NVDA).

Added Predictors and Metrics

While the six predictors provided by Yahoo Finance give significant insight into each stock’s historical performance over the year, there may be other, more useful metrics that we can use to assess and predict the future growth of our stocks. The main kind of variables we wish to introduce are ones which simply keep track of data from previous days; since each predictor in our original data frame only applies to a 24-hour window, there could be some potentially important information in the long-term trends of a stock which ultimately affect a share’s price.

n-day Average of Closing Price

As consumers use historical stock data to determine whether a certain stock is worth buying or not, it becomes apparent that stocks’ price is, in one way or another, dependent on its previous value. While this is technically true for any continuous function / continuous random variable, it is clear that even long-term data can affect a stock’s current value — for example, if a stock has been in a steady downward trend for quite some time, it will negatively affect the perception of potential investors.

While there are multiple financial metrics which account for previous stock prices, this analysis will only look at two basic measurements: the n-day average and the n-day standard deviation (where n is some integer-valued hyper-parameter). Although there are subtle differences between the opening price and the closing price of a stock, the larger the value of n is (in our n-day average) the less the distinction should matter in terms of which variable to average; for consistency, we will simply base our new metrics on the closing costs of each stock.

Additionally, there is no clear choice for how much previous data to account for — should the analysis look back at a single week’s worth of data or a month? As this is itself an interesting question for the sake of tuning our models, we will consider this an added hyperparameter for the problem and consider four possible values: 1 week, 2 weeks, 1 month, and 2 months.

running_average <- function(my_vec, num_days) {
  #' Takes the running average of a column vector
  #'
  #' Creates a new column vector whose entries are the average of the previous num_days entries.
  #' When not enough data is available to take the average over num_days, the closest possible 
  #' average will be taken (for example, if num_days = 10, then the first 2nd entry of the output
  #' vector will simply be the average of the first two values, the 3rd entry of the output vector
  #' will be the average of the first three values, and so forth.)
  #'
  #' @param my_vec the column vector to take the average values of
  #' @param num_days the number of days one wishes to average over
  #' 
  #' @return A vector whose entries represent the average of the previous num_days entries in my_vec
  
  
  # Error handling
  if(is.vector(my_vec) == FALSE){
    stop("Not Vector: First argument of running_average must be a vector")
  }
  if(is.numeric(my_vec[1]) == FALSE){
    stop("Non-numeric Entries: values of vector in first argument must be numeric.")
  }
  if(is.numeric(num_days) == FALSE || num_days != round(num_days)){
    stop("Not Integer: Second argument of running_average must be an integer larger than or equal to 2")
  }
  if(num_days <= 1){
    stop("Not Large Enough: Second argument of running_average must be an integer larger than or equal to 2")
  }

  # dummy variable to keep track of sums
  sum_counter = 0
  # return variable
  output_vec = c()
  for (i in 1:length(my_vec)) {
    
    # If there are less that num_days of data previous to the current date,
    # simply take the average of all the days prior to get the closest thing
    # to a running average
    if (i <= num_days){
      sum_counter = sum_counter + my_vec[i]
      output_vec[i] = sum_counter / i
    }
    else {
      # Add the next day to the sum
      sum_counter = sum_counter + my_vec[i]
      # Subtract the data from two weeks prior
      sum_counter = sum_counter - my_vec[i-num_days]
      output_vec[i] = sum_counter / num_days
    }
  }
  return(output_vec)
}
  
SSD$NVDA_avg_cl_1W <- running_average(SSD$NVDA_Close, 5)
SSD$TSM_avg_cl_1W <- running_average(SSD$TSM_Close, 5)
SSD$NXPI_avg_cl_1W <- running_average(SSD$NXPI_Close, 5)
SSD$QCOM_avg_cl_1W <- running_average(SSD$QCOM_Close, 5)
SSD$MPWR_avg_cl_1W <- running_average(SSD$MPWR_Close, 5)
SSD$ON_avg_cl_1W <- running_average(SSD$ON_Close, 5)
SSD$AMD_avg_cl_1W <- running_average(SSD$AMD_Close, 5)
SSD$INTC_avg_cl_1W <- running_average(SSD$INTC_Close, 5)
SSD$AVGO_avg_cl_1W <- running_average(SSD$AVGO_Close, 5)
SSD$ASML_avg_cl_1W <- running_average(SSD$ASML_Close, 5)
SSD$AMAT_avg_cl_1W <- running_average(SSD$AMAT_Close, 5)
SSD$TXN_avg_cl_1W <- running_average(SSD$TXN_Close, 5)

SSD$NVDA_avg_cl_2W <- running_average(SSD$NVDA_Close, 10)
SSD$TSM_avg_cl_2W <- running_average(SSD$TSM_Close, 10)
SSD$NXPI_avg_cl_2W <- running_average(SSD$NXPI_Close, 10)
SSD$QCOM_avg_cl_2W <- running_average(SSD$QCOM_Close, 10)
SSD$MPWR_avg_cl_2W <- running_average(SSD$MPWR_Close, 10)
SSD$ON_avg_cl_2W <- running_average(SSD$ON_Close, 10)
SSD$AMD_avg_cl_2W <- running_average(SSD$AMD_Close, 10)
SSD$INTC_avg_cl_2W <- running_average(SSD$INTC_Close, 10)
SSD$AVGO_avg_cl_2W <- running_average(SSD$AVGO_Close, 10)
SSD$ASML_avg_cl_2W <- running_average(SSD$ASML_Close, 10)
SSD$AMAT_avg_cl_2W <- running_average(SSD$AMAT_Close, 10)
SSD$TXN_avg_cl_2W <- running_average(SSD$TXN_Close, 10)

SSD$NVDA_avg_cl_1M <- running_average(SSD$NVDA_Close, 20)
SSD$TSM_avg_cl_1M <- running_average(SSD$TSM_Close, 20)
SSD$NXPI_avg_cl_1M <- running_average(SSD$NXPI_Close, 20)
SSD$QCOM_avg_cl_1M <- running_average(SSD$QCOM_Close, 20)
SSD$MPWR_avg_cl_1M <- running_average(SSD$MPWR_Close, 20)
SSD$ON_avg_cl_1M <- running_average(SSD$ON_Close, 20)
SSD$AMD_avg_cl_1M <- running_average(SSD$AMD_Close, 20)
SSD$INTC_avg_cl_1M <- running_average(SSD$INTC_Close, 20)
SSD$AVGO_avg_cl_1M <- running_average(SSD$AVGO_Close, 20)
SSD$ASML_avg_cl_1M <- running_average(SSD$ASML_Close, 20)
SSD$AMAT_avg_cl_1M <- running_average(SSD$AMAT_Close, 20)
SSD$TXN_avg_cl_1M <- running_average(SSD$TXN_Close, 20)

SSD$NVDA_avg_cl_2M <- running_average(SSD$NVDA_Close, 40)
SSD$TSM_avg_cl_2M <- running_average(SSD$TSM_Close, 40)
SSD$NXPI_avg_cl_2M <- running_average(SSD$NXPI_Close, 40)
SSD$QCOM_avg_cl_2M <- running_average(SSD$QCOM_Close, 40)
SSD$MPWR_avg_cl_2M <- running_average(SSD$MPWR_Close, 40)
SSD$ON_avg_cl_2M <- running_average(SSD$ON_Close, 40)
SSD$AMD_avg_cl_2M <- running_average(SSD$AMD_Close, 40)
SSD$INTC_avg_cl_2M <- running_average(SSD$INTC_Close, 40)
SSD$AVGO_avg_cl_2M <- running_average(SSD$AVGO_Close, 40)
SSD$ASML_avg_cl_2M <- running_average(SSD$ASML_Close, 40)
SSD$AMAT_avg_cl_2M <- running_average(SSD$AMAT_Close, 40)
SSD$TXN_avg_cl_2M <- running_average(SSD$TXN_Close, 40)


ggplot(data = SSD, aes(x=Date)) +
  geom_line(aes(y = AMD_avg_cl_2W, color = 'AMD')) +
  geom_line(aes(y = NXPI_avg_cl_2W, color = 'NXPI')) + 
  geom_line(aes(y = TXN_avg_cl_2W, color = 'TXN')) +
  geom_line(aes(y = AMAT_avg_cl_2W, color = 'AMAT')) +
  scale_color_manual(values = c(
    'AMD' = 'green',
    'AMAT' = 'white',
    'NXPI' = 'pink',
    'TXN' = 'lightblue')) +
  ylab('USD') + 
  ggtitle("2-Week Average") +
  theme_dark()


ggplot(data = SSD, aes(x=Date)) +
  geom_line(aes(y = NVDA_avg_cl_2W, color = 'NVDA')) +
  geom_line(aes(y = MPWR_avg_cl_2W, color = 'MPWR')) + 
  geom_line(aes(y = AVGO_avg_cl_2W, color = 'AVGO')) +
  geom_line(aes(y = ASML_avg_cl_2W, color = 'ASML')) +
  scale_color_manual(values = c(
    'NVDA' = 'darkolivegreen1',
    'MPWR' = 'moccasin',
    'AVGO' = 'coral',
    'ASML' = 'gold')) +
  ylab('USD') + 
  ggtitle("2-Week Average") +
  theme_dark()

ggplot(data = SSD, aes(x=Date)) +
  geom_line(aes(y = ON_avg_cl_2W, color = 'ON')) +
  geom_line(aes(y = QCOM_avg_cl_2W, color = 'QCOM')) + 
  geom_line(aes(y = INTC_avg_cl_2W, color = 'INTC')) +
  geom_line(aes(y = TSM_avg_cl_2W, color = 'TSM')) +
  scale_color_manual(values = c(
    'ON' = 'cyan',
    'QCOM' = 'purple',
    'INTC' = 'yellow',
    'TSM' = 'red')) +
  ylab('USD') + 
  ggtitle("2-Week Average") +
  theme_dark()

One characteristic that immediately becomes apparent is that evaluating the running averages instead of the closing costs seems to “smooth out” the curves — in other words, the running average is much more stable and is not affected by a share’s volatility as much as our original predictors obtained from the CSV. In fact, what we are actually doing is slowly interpolating the data with the overall average; since the overall average is a constant function (and thus linear), the “smoothing out” process is simply a result of interpolating with a \(C^\infty(\mathbb{R})\) (smooth) function.

ggplot(data = SSD, aes(x=Date)) +
  geom_line(aes(y = AMD_avg_cl_2M, color = 'AMD')) +
  geom_line(aes(y = NXPI_avg_cl_2M, color = 'NXPI')) + 
  geom_line(aes(y = TXN_avg_cl_2M, color = 'TXN')) +
  geom_line(aes(y = AMAT_avg_cl_2M, color = 'AMAT')) +
  scale_color_manual(values = c(
    'AMD' = 'green',
    'AMAT' = 'white',
    'NXPI' = 'pink',
    'TXN' = 'lightblue')) +
  ylab('USD') + 
  ggtitle("2-Month Average") +
  theme_dark()


ggplot(data = SSD, aes(x=Date)) +
  geom_line(aes(y = NVDA_avg_cl_2M, color = 'NVDA')) +
  geom_line(aes(y = MPWR_avg_cl_2M, color = 'MPWR')) + 
  geom_line(aes(y = AVGO_avg_cl_2M, color = 'AVGO')) +
  geom_line(aes(y = ASML_avg_cl_2M, color = 'ASML')) +
  scale_color_manual(values = c(
    'NVDA' = 'darkolivegreen1',
    'MPWR' = 'moccasin',
    'AVGO' = 'coral',
    'ASML' = 'gold')) +
  ylab('USD') + 
  ggtitle("2-Month Average") +
  theme_dark()

ggplot(data = SSD, aes(x=Date)) +
  geom_line(aes(y = ON_avg_cl_2M, color = 'ON')) +
  geom_line(aes(y = QCOM_avg_cl_2M, color = 'QCOM')) + 
  geom_line(aes(y = INTC_avg_cl_2M, color = 'INTC')) +
  geom_line(aes(y = TSM_avg_cl_2M, color = 'TSM')) +
  scale_color_manual(values = c(
    'ON' = 'cyan',
    'QCOM' = 'purple',
    'INTC' = 'yellow',
    'TSM' = 'red')) +
  ylab('USD') + 
  ggtitle("2-Month Average") +
  theme_dark()

n-Day Standard Deviation of Closing Price

With a concrete notion of the n-day average closing price of a stock, it is natural to measure the standard deviation as well to gain an accurate insight on the volatility of each stock.

running_deviation <- function(my_vec, num_days) {
  #' Takes the running standard deviation of a column vector
  #'
  #' Creates a new column vector whose entries are the standard deviation of the previous num_days entries.
  #' When not enough data is available to take the deviation over num_days, the closest possible 
  #' average will be taken (for example, if num_days = 10, then the first 2nd entry of the output
  #' vector will simply be the average of the first two values, the 3rd entry of the output vector
  #' will be the average of the first three values, and so forth.)
  #'
  #' @param my_vec the column vector to take the standard deviation of
  #' @param num_days the number of days one wishes to average over
  #' 
  #' @return A vector whose entries represent the standard deviation of the previous num_days entries in my_vec
  
  
  # Error handling
  if(is.vector(my_vec) == FALSE){
    stop("Not Vector: First argument of running_average must be a vector")
  }
  if(is.numeric(my_vec[1]) == FALSE){
    stop("Non-numeric Entries: values of vector in first argument must be numeric.")
  }
  if(is.numeric(num_days) == FALSE || num_days != round(num_days)){
    stop("Not Integer: Second argument of running_average must be an integer larger than or equal to 2")
  }
  if(num_days <= 1){
    stop("Not Large Enough: Second argument of running_average must be an integer larger than or equal to 2")
  }
  
  
  run_avg = running_average(my_vec, num_days)

  # dummy variable to keep track of sums
  sum_counter = 0
  # return variable
  output_vec = c()
  
  # Setting the first standard deviation to 0 and beginning the loop
  # at 2 prevents a divide by 0 error without adding an additional if-else branch
  # in the loop
  output_vec[1] = 0
  for (i in 2:length(my_vec)) {
    
    # If there are less that num_days of data previous to the current date,
    # simply take the average of all the days prior to get the closest thing
    # to a running average
    if (i <= num_days){
      sum_counter = sum_counter + (my_vec[i] - run_avg[i])**2
      output_vec[i] = sqrt((sum_counter / (i-1)))
    }
    else {
      # Add the next day to the sum
      sum_counter = sum_counter + (my_vec[i] - run_avg[i])**2
      # Subtract the data from num_days prior
      sum_counter = sum_counter - (my_vec[(i - num_days)] - run_avg[(i-num_days)])**2
      output_vec[i] = sqrt((sum_counter / (num_days-1)))
    }
  }
  return(output_vec)
}

SSD$NVDA_std_dev_cl_1W <- running_deviation(SSD$NVDA_Close, 5)
SSD$TSM_std_dev_cl_1W <- running_deviation(SSD$TSM_Close, 5)
SSD$NXPI_std_dev_cl_1W <- running_deviation(SSD$NXPI_Close, 5)
SSD$QCOM_std_dev_cl_1W <- running_deviation(SSD$QCOM_Close, 5)
SSD$MPWR_std_dev_cl_1W <- running_deviation(SSD$MPWR_Close, 5)
SSD$ON_std_dev_cl_1W <- running_deviation(SSD$ON_Close, 5)
SSD$AMD_std_dev_cl_1W <- running_deviation(SSD$AMD_Close, 5)
SSD$INTC_std_dev_cl_1W <- running_deviation(SSD$INTC_Close, 5)
SSD$AVGO_std_dev_cl_1W <- running_deviation(SSD$AVGO_Close, 5)
SSD$ASML_std_dev_cl_1W <- running_deviation(SSD$ASML_Close, 5)
SSD$AMAT_std_dev_cl_1W <- running_deviation(SSD$AMAT_Close, 5)
SSD$TXN_std_dev_cl_1W <- running_deviation(SSD$TXN_Close, 5)

SSD$NVDA_std_dev_cl_2W <- running_deviation(SSD$NVDA_Close, 10)
SSD$TSM_std_dev_cl_2W <- running_deviation(SSD$TSM_Close, 10)
SSD$NXPI_std_dev_cl_2W <- running_deviation(SSD$NXPI_Close, 10)
SSD$QCOM_std_dev_cl_2W <- running_deviation(SSD$QCOM_Close, 10)
SSD$MPWR_std_dev_cl_2W <- running_deviation(SSD$MPWR_Close, 10)
SSD$ON_std_dev_cl_2W <- running_deviation(SSD$ON_Close, 10)
SSD$AMD_std_dev_cl_2W <- running_deviation(SSD$AMD_Close, 10)
SSD$INTC_std_dev_cl_2W <- running_deviation(SSD$INTC_Close, 10)
SSD$AVGO_std_dev_cl_2W <- running_deviation(SSD$AVGO_Close, 10)
SSD$ASML_std_dev_cl_2W <- running_deviation(SSD$ASML_Close, 10)
SSD$AMAT_std_dev_cl_2W <- running_deviation(SSD$AMAT_Close, 10)
SSD$TXN_std_dev_cl_2W <- running_deviation(SSD$TXN_Close, 10)

SSD$NVDA_std_dev_cl_1M <- running_deviation(SSD$NVDA_Close, 20)
SSD$TSM_std_dev_cl_1M <- running_deviation(SSD$TSM_Close, 20)
SSD$NXPI_std_dev_cl_1M <- running_deviation(SSD$NXPI_Close, 20)
SSD$QCOM_std_dev_cl_1M <- running_deviation(SSD$QCOM_Close, 20)
SSD$MPWR_std_dev_cl_1M <- running_deviation(SSD$MPWR_Close, 20)
SSD$ON_std_dev_cl_1M <- running_deviation(SSD$ON_Close, 20)
SSD$AMD_std_dev_cl_1M <- running_deviation(SSD$AMD_Close, 20)
SSD$INTC_std_dev_cl_1M <- running_deviation(SSD$INTC_Close, 20)
SSD$AVGO_std_dev_cl_1M <- running_deviation(SSD$AVGO_Close, 20)
SSD$ASML_std_dev_cl_1M <- running_deviation(SSD$ASML_Close, 20)
SSD$AMAT_std_dev_cl_1M <- running_deviation(SSD$AMAT_Close, 20)
SSD$TXN_std_dev_cl_1M <- running_deviation(SSD$TXN_Close, 20)

SSD$NVDA_std_dev_cl_2M <- running_deviation(SSD$NVDA_Close, 40)
SSD$TSM_std_dev_cl_2M <- running_deviation(SSD$TSM_Close, 40)
SSD$NXPI_std_dev_cl_2M <- running_deviation(SSD$NXPI_Close, 40)
SSD$QCOM_std_dev_cl_2M <- running_deviation(SSD$QCOM_Close, 40)
SSD$MPWR_std_dev_cl_2M <- running_deviation(SSD$MPWR_Close, 40)
SSD$ON_std_dev_cl_2M <- running_deviation(SSD$ON_Close, 40)
SSD$AMD_std_dev_cl_2M <- running_deviation(SSD$AMD_Close, 40)
SSD$INTC_std_dev_cl_2M <- running_deviation(SSD$INTC_Close, 40)
SSD$AVGO_std_dev_cl_2M <- running_deviation(SSD$AVGO_Close, 40)
SSD$ASML_std_dev_cl_2M <- running_deviation(SSD$ASML_Close, 40)
SSD$AMAT_std_dev_cl_2M <- running_deviation(SSD$AMAT_Close, 40)
SSD$TXN_std_dev_cl_2M <- running_deviation(SSD$TXN_Close, 40)


ggplot(data = SSD, aes(x=Date)) +
  geom_line(aes(y = AMD_std_dev_cl_2W, color = 'AMD')) +
  geom_line(aes(y = NXPI_std_dev_cl_2W, color = 'NXPI')) + 
  geom_line(aes(y = TXN_std_dev_cl_2W, color = 'TXN')) +
  geom_line(aes(y = AMAT_std_dev_cl_2W, color = 'AMAT')) +
  scale_color_manual(values = c(
    'AMD' = 'green',
    'AMAT' = 'white',
    'NXPI' = 'pink',
    'TXN' = 'lightblue')) +
  ylab('USD') + 
  ggtitle("2-Week Standard Deviation") +
  theme_dark()


ggplot(data = SSD, aes(x=Date)) +
  geom_line(aes(y = NVDA_std_dev_cl_2W, color = 'NVDA')) +
  geom_line(aes(y = MPWR_std_dev_cl_2W, color = 'MPWR')) + 
  geom_line(aes(y = AVGO_std_dev_cl_2W, color = 'AVGO')) +
  geom_line(aes(y = ASML_std_dev_cl_2W, color = 'ASML')) +
  scale_color_manual(values = c(
    'NVDA' = 'darkolivegreen1',
    'MPWR' = 'moccasin',
    'AVGO' = 'coral',
    'ASML' = 'gold')) +
  ylab('USD') + 
  ggtitle("2-Week Standard Deviation") +
  theme_dark()

ggplot(data = SSD, aes(x=Date)) +
  geom_line(aes(y = ON_std_dev_cl_2W, color = 'ON')) +
  geom_line(aes(y = QCOM_std_dev_cl_2W, color = 'QCOM')) + 
  geom_line(aes(y = INTC_std_dev_cl_2W, color = 'INTC')) +
  geom_line(aes(y = TSM_std_dev_cl_2W, color = 'TSM')) +
  scale_color_manual(values = c(
    'ON' = 'cyan',
    'QCOM' = 'purple',
    'INTC' = 'yellow',
    'TSM' = 'red')) +
  ylab('USD') + 
  ggtitle("2-Week Standard Deviation") +
  theme_dark()

n-Day Average Return

One of the main reasons someone invests in a stock is because they believe there is some sort of profit to be made based on the company’s performance. On a day-to-day basis, this is simply measured by the difference between the closing price and the opening price — if the closing price is higher than the opening price, an investor theoretically increased their net worth that day (and vice versa). Although we are technically already measuring the average closing price, this does not account for possible downward trends since it is simply taking the average value of a set of prices; on the other hand, looking at the difference between opening and closing price gives some short-term insight into the overall trend of a stock on a daily basis.

SSD$NVDA_Return <- (SSD$NVDA_Adj_Close - SSD$NVDA_Open)
SSD$TSM_Return <- (SSD$TSM_Adj_Close - SSD$TSM_Open)
SSD$NXPI_Return <- (SSD$NXPI_Adj_Close - SSD$NXPI_Open)
SSD$QCOM_Return <- (SSD$QCOM_Adj_Close - SSD$QCOM_Open)
SSD$MPWR_Return <- (SSD$MPWR_Adj_Close - SSD$MPWR_Open)
SSD$ON_Return <- (SSD$ON_Adj_Close - SSD$ON_Open)
SSD$AMD_Return <- (SSD$AMD_Adj_Close - SSD$AMD_Open)
SSD$INTC_Return <- (SSD$INTC_Adj_Close - SSD$INTC_Open)
SSD$AVGO_Return <- (SSD$AVGO_Adj_Close - SSD$AVGO_Open)
SSD$ASML_Return <- (SSD$ASML_Adj_Close - SSD$ASML_Open)
SSD$AMAT_Return <- (SSD$AMAT_Adj_Close - SSD$AMAT_Open)
SSD$TXN_Return <- (SSD$TXN_Adj_Close - SSD$TXN_Open)


SSD$NVDA_avg_ret_1W <- running_average(SSD$NVDA_Return, 5)
SSD$TSM_avg_ret_1W <- running_average(SSD$TSM_Return, 5)
SSD$NXPI_avg_ret_1W <- running_average(SSD$NXPI_Return, 5)
SSD$QCOM_avg_ret_1W <- running_average(SSD$QCOM_Return, 5)
SSD$MPWR_avg_ret_1W <- running_average(SSD$MPWR_Return, 5)
SSD$ON_avg_ret_1W <- running_average(SSD$ON_Return, 5)
SSD$AMD_avg_ret_1W <- running_average(SSD$AMD_Return, 5)
SSD$INTC_avg_ret_1W <- running_average(SSD$INTC_Return, 5)
SSD$AVGO_avg_ret_1W <- running_average(SSD$AVGO_Return, 5)
SSD$ASML_avg_ret_1W <- running_average(SSD$ASML_Return, 5)
SSD$AMAT_avg_ret_1W <- running_average(SSD$AMAT_Return, 5)
SSD$TXN_avg_ret_1W <- running_average(SSD$TXN_Return, 5)

SSD$NVDA_avg_ret_2W <- running_average(SSD$NVDA_Return, 10)
SSD$TSM_avg_ret_2W <- running_average(SSD$TSM_Return, 10)
SSD$NXPI_avg_ret_2W <- running_average(SSD$NXPI_Return, 10)
SSD$QCOM_avg_ret_2W <- running_average(SSD$QCOM_Return, 10)
SSD$MPWR_avg_ret_2W <- running_average(SSD$MPWR_Return, 10)
SSD$ON_avg_ret_2W <- running_average(SSD$ON_Return, 10)
SSD$AMD_avg_ret_2W <- running_average(SSD$AMD_Return, 10)
SSD$INTC_avg_ret_2W <- running_average(SSD$INTC_Return, 10)
SSD$AVGO_avg_ret_2W <- running_average(SSD$AVGO_Return, 10)
SSD$ASML_avg_ret_2W <- running_average(SSD$ASML_Return, 10)
SSD$AMAT_avg_ret_2W <- running_average(SSD$AMAT_Return, 10)
SSD$TXN_avg_ret_2W <- running_average(SSD$TXN_Return, 10)

SSD$NVDA_avg_ret_1M <- running_average(SSD$NVDA_Return, 20)
SSD$TSM_avg_ret_1M <- running_average(SSD$TSM_Return, 20)
SSD$NXPI_avg_ret_1M <- running_average(SSD$NXPI_Return, 20)
SSD$QCOM_avg_ret_1M <- running_average(SSD$QCOM_Return, 20)
SSD$MPWR_avg_ret_1M <- running_average(SSD$MPWR_Return, 20)
SSD$ON_avg_ret_1M <- running_average(SSD$ON_Return, 20)
SSD$AMD_avg_ret_1M <- running_average(SSD$AMD_Return, 20)
SSD$INTC_avg_ret_1M <- running_average(SSD$INTC_Return, 20)
SSD$AVGO_avg_ret_1M <- running_average(SSD$AVGO_Return, 20)
SSD$ASML_avg_ret_1M <- running_average(SSD$ASML_Return, 20)
SSD$AMAT_avg_ret_1M <- running_average(SSD$AMAT_Return, 20)
SSD$TXN_avg_ret_1M <- running_average(SSD$TXN_Return, 20)

SSD$NVDA_avg_ret_2M <- running_average(SSD$NVDA_Return, 40)
SSD$TSM_avg_ret_2M <- running_average(SSD$TSM_Return, 40)
SSD$NXPI_avg_ret_2M <- running_average(SSD$NXPI_Return, 40)
SSD$QCOM_avg_ret_2M <- running_average(SSD$QCOM_Return, 40)
SSD$MPWR_avg_ret_2M <- running_average(SSD$MPWR_Return, 40)
SSD$ON_avg_ret_2M <- running_average(SSD$ON_Return, 40)
SSD$AMD_avg_ret_2M <- running_average(SSD$AMD_Return, 40)
SSD$INTC_avg_ret_2M <- running_average(SSD$INTC_Return, 40)
SSD$AVGO_avg_ret_2M <- running_average(SSD$AVGO_Return, 40)
SSD$ASML_avg_ret_2M <- running_average(SSD$ASML_Return, 40)
SSD$AMAT_avg_ret_2M <- running_average(SSD$AMAT_Return, 40)
SSD$TXN_avg_ret_2M <- running_average(SSD$TXN_Return, 40)


ggplot(data = SSD, aes(x=Date)) +
  geom_line(aes(y = AMD_avg_ret_2W, color = 'AMD')) +
  geom_line(aes(y = NXPI_avg_ret_2W, color = 'NXPI')) + 
  geom_line(aes(y = TXN_avg_ret_2W, color = 'TXN')) +
  geom_line(aes(y = AMAT_avg_ret_2W, color = 'AMAT')) +
  scale_color_manual(values = c(
    'AMD' = 'green',
    'AMAT' = 'white',
    'NXPI' = 'pink',
    'TXN' = 'lightblue')) +
  ylab('USD') + 
  ggtitle("2-Week Average Return") +
  theme_dark()


ggplot(data = SSD, aes(x=Date)) +
  geom_line(aes(y = NVDA_avg_ret_2W, color = 'NVDA')) +
  geom_line(aes(y = MPWR_avg_ret_2W, color = 'MPWR')) + 
  geom_line(aes(y = AVGO_avg_ret_2W, color = 'AVGO')) +
  geom_line(aes(y = ASML_avg_ret_2W, color = 'ASML')) +
  scale_color_manual(values = c(
    'NVDA' = 'darkolivegreen1',
    'MPWR' = 'moccasin',
    'AVGO' = 'coral',
    'ASML' = 'gold')) +
  ylab('USD') + 
  ggtitle("2-Week Average Return") +
  theme_dark()

ggplot(data = SSD, aes(x=Date)) +
  geom_line(aes(y = ON_avg_ret_2W, color = 'ON')) +
  geom_line(aes(y = QCOM_avg_ret_2W, color = 'QCOM')) + 
  geom_line(aes(y = INTC_avg_ret_2W, color = 'INTC')) +
  geom_line(aes(y = TSM_avg_ret_2W, color = 'TSM')) +
  scale_color_manual(values = c(
    'ON' = 'cyan',
    'QCOM' = 'purple',
    'INTC' = 'yellow',
    'TSM' = 'red')) +
  ylab('USD') + 
  ggtitle("2-Week Average Return") +
  theme_dark()

n-Day Return Standard Deviation

SSD$NVDA_std_dev_ret_1W <- running_deviation(SSD$NVDA_Return, 5)
SSD$TSM_std_dev_ret_1W <- running_deviation(SSD$TSM_Return, 5)
SSD$NXPI_std_dev_ret_1W <- running_deviation(SSD$NXPI_Return, 5)
SSD$QCOM_std_dev_ret_1W <- running_deviation(SSD$QCOM_Return, 5)
SSD$MPWR_std_dev_ret_1W <- running_deviation(SSD$MPWR_Return, 5)
SSD$ON_std_dev_ret_1W <- running_deviation(SSD$ON_Return, 5)
SSD$AMD_std_dev_ret_1W <- running_deviation(SSD$AMD_Return, 5)
SSD$INTC_std_dev_ret_1W <- running_deviation(SSD$INTC_Return, 5)
SSD$AVGO_std_dev_ret_1W <- running_deviation(SSD$AVGO_Return, 5)
SSD$ASML_std_dev_ret_1W <- running_deviation(SSD$ASML_Return, 5)
SSD$AMAT_std_dev_ret_1W <- running_deviation(SSD$AMAT_Return, 5)
SSD$TXN_std_dev_ret_1W <- running_deviation(SSD$TXN_Return, 5)

SSD$NVDA_std_dev_ret_2W <- running_deviation(SSD$NVDA_Return, 10)
SSD$TSM_std_dev_ret_2W <- running_deviation(SSD$TSM_Return, 10)
SSD$NXPI_std_dev_ret_2W <- running_deviation(SSD$NXPI_Return, 10)
SSD$QCOM_std_dev_ret_2W <- running_deviation(SSD$QCOM_Return, 10)
SSD$MPWR_std_dev_ret_2W <- running_deviation(SSD$MPWR_Return, 10)
SSD$ON_std_dev_ret_2W <- running_deviation(SSD$ON_Return, 10)
SSD$AMD_std_dev_ret_2W <- running_deviation(SSD$AMD_Return, 10)
SSD$INTC_std_dev_ret_2W <- running_deviation(SSD$INTC_Return, 10)
SSD$AVGO_std_dev_ret_2W <- running_deviation(SSD$AVGO_Return, 10)
SSD$ASML_std_dev_ret_2W <- running_deviation(SSD$ASML_Return, 10)
SSD$AMAT_std_dev_ret_2W <- running_deviation(SSD$AMAT_Return, 10)
SSD$TXN_std_dev_ret_2W <- running_deviation(SSD$TXN_Return, 10)

SSD$NVDA_std_dev_ret_1M <- running_deviation(SSD$NVDA_Return, 20)
SSD$TSM_std_dev_ret_1M <- running_deviation(SSD$TSM_Return, 20)
SSD$NXPI_std_dev_ret_1M <- running_deviation(SSD$NXPI_Return, 20)
SSD$QCOM_std_dev_ret_1M <- running_deviation(SSD$QCOM_Return, 20)
SSD$MPWR_std_dev_ret_1M <- running_deviation(SSD$MPWR_Return, 20)
SSD$ON_std_dev_ret_1M <- running_deviation(SSD$ON_Return, 20)
SSD$AMD_std_dev_ret_1M <- running_deviation(SSD$AMD_Return, 20)
SSD$INTC_std_dev_ret_1M <- running_deviation(SSD$INTC_Return, 20)
SSD$AVGO_std_dev_ret_1M <- running_deviation(SSD$AVGO_Return, 20)
SSD$ASML_std_dev_ret_1M <- running_deviation(SSD$ASML_Return, 20)
SSD$AMAT_std_dev_ret_1M <- running_deviation(SSD$AMAT_Return, 20)
SSD$TXN_std_dev_ret_1M <- running_deviation(SSD$TXN_Return, 20)

SSD$NVDA_std_dev_ret_2M <- running_deviation(SSD$NVDA_Return, 40)
SSD$TSM_std_dev_ret_2M <- running_deviation(SSD$TSM_Return, 40)
SSD$NXPI_std_dev_ret_2M <- running_deviation(SSD$NXPI_Return, 40)
SSD$QCOM_std_dev_ret_2M <- running_deviation(SSD$QCOM_Return, 40)
SSD$MPWR_std_dev_ret_2M <- running_deviation(SSD$MPWR_Return, 40)
SSD$ON_std_dev_ret_2M <- running_deviation(SSD$ON_Return, 40)
SSD$AMD_std_dev_ret_2M <- running_deviation(SSD$AMD_Return, 40)
SSD$INTC_std_dev_ret_2M <- running_deviation(SSD$INTC_Return, 40)
SSD$AVGO_std_dev_ret_2M <- running_deviation(SSD$AVGO_Return, 40)
SSD$ASML_std_dev_ret_2M <- running_deviation(SSD$ASML_Return, 40)
SSD$AMAT_std_dev_ret_2M <- running_deviation(SSD$AMAT_Return, 40)
SSD$TXN_std_dev_ret_2M <- running_deviation(SSD$TXN_Return, 40)


ggplot(data = SSD, aes(x=Date)) +
  geom_line(aes(y = AMD_std_dev_ret_2W, color = 'AMD')) +
  geom_line(aes(y = NXPI_std_dev_ret_2W, color = 'NXPI')) + 
  geom_line(aes(y = TXN_std_dev_ret_2W, color = 'TXN')) +
  geom_line(aes(y = AMAT_std_dev_ret_2W, color = 'AMAT')) +
  scale_color_manual(values = c(
    'AMD' = 'green',
    'AMAT' = 'white',
    'NXPI' = 'pink',
    'TXN' = 'lightblue')) +
  ylab('USD') + 
  ggtitle("2-Week Standard Deviation") +
  theme_dark()


ggplot(data = SSD, aes(x=Date)) +
  geom_line(aes(y = NVDA_std_dev_ret_2W, color = 'NVDA')) +
  geom_line(aes(y = MPWR_std_dev_ret_2W, color = 'MPWR')) + 
  geom_line(aes(y = AVGO_std_dev_ret_2W, color = 'AVGO')) +
  geom_line(aes(y = ASML_std_dev_ret_2W, color = 'ASML')) +
  scale_color_manual(values = c(
    'NVDA' = 'darkolivegreen1',
    'MPWR' = 'moccasin',
    'AVGO' = 'coral',
    'ASML' = 'gold')) +
  ylab('USD') + 
  ggtitle("2-Week Standard Deviation") +
  theme_dark()

ggplot(data = SSD, aes(x=Date)) +
  geom_line(aes(y = ON_std_dev_ret_2W, color = 'ON')) +
  geom_line(aes(y = QCOM_std_dev_ret_2W, color = 'QCOM')) + 
  geom_line(aes(y = INTC_std_dev_ret_2W, color = 'INTC')) +
  geom_line(aes(y = TSM_std_dev_ret_2W, color = 'TSM')) +
  scale_color_manual(values = c(
    'ON' = 'cyan',
    'QCOM' = 'purple',
    'INTC' = 'yellow',
    'TSM' = 'red')) +
  ylab('USD') + 
  ggtitle("2-Week Standard Deviation") +
  theme_dark()

n-Day Average Volume

SSD$NVDA_avg_vol_1W <- running_average(SSD$NVDA_Volume, 5)
SSD$TSM_avg_vol_1W <- running_average(SSD$TSM_Volume, 5)
SSD$NXPI_avg_vol_1W <- running_average(SSD$NXPI_Volume, 5)
SSD$QCOM_avg_vol_1W <- running_average(SSD$QCOM_Volume, 5)
SSD$MPWR_avg_vol_1W <- running_average(SSD$MPWR_Volume, 5)
SSD$ON_avg_vol_1W <- running_average(SSD$ON_Volume, 5)
SSD$AMD_avg_vol_1W <- running_average(SSD$AMD_Volume, 5)
SSD$INTC_avg_vol_1W <- running_average(SSD$INTC_Volume, 5)
SSD$AVGO_avg_vol_1W <- running_average(SSD$AVGO_Volume, 5)
SSD$ASML_avg_vol_1W <- running_average(SSD$ASML_Volume, 5)
SSD$AMAT_avg_vol_1W <- running_average(SSD$AMAT_Volume, 5)
SSD$TXN_avg_vol_1W <- running_average(SSD$TXN_Volume, 5)

SSD$NVDA_avg_vol_2W <- running_average(SSD$NVDA_Volume, 10)
SSD$TSM_avg_vol_2W <- running_average(SSD$TSM_Volume, 10)
SSD$NXPI_avg_vol_2W <- running_average(SSD$NXPI_Volume, 10)
SSD$QCOM_avg_vol_2W <- running_average(SSD$QCOM_Volume, 10)
SSD$MPWR_avg_vol_2W <- running_average(SSD$MPWR_Volume, 10)
SSD$ON_avg_vol_2W <- running_average(SSD$ON_Volume, 10)
SSD$AMD_avg_vol_2W <- running_average(SSD$AMD_Volume, 10)
SSD$INTC_avg_vol_2W <- running_average(SSD$INTC_Volume, 10)
SSD$AVGO_avg_vol_2W <- running_average(SSD$AVGO_Volume, 10)
SSD$ASML_avg_vol_2W <- running_average(SSD$ASML_Volume, 10)
SSD$AMAT_avg_vol_2W <- running_average(SSD$AMAT_Volume, 10)
SSD$TXN_avg_vol_2W <- running_average(SSD$TXN_Volume, 10)

SSD$NVDA_avg_vol_1M <- running_average(SSD$NVDA_Volume, 20)
SSD$TSM_avg_vol_1M <- running_average(SSD$TSM_Volume, 20)
SSD$NXPI_avg_vol_1M <- running_average(SSD$NXPI_Volume, 20)
SSD$QCOM_avg_vol_1M <- running_average(SSD$QCOM_Volume, 20)
SSD$MPWR_avg_vol_1M <- running_average(SSD$MPWR_Volume, 20)
SSD$ON_avg_vol_1M <- running_average(SSD$ON_Volume, 20)
SSD$AMD_avg_vol_1M <- running_average(SSD$AMD_Volume, 20)
SSD$INTC_avg_vol_1M <- running_average(SSD$INTC_Volume, 20)
SSD$AVGO_avg_vol_1M <- running_average(SSD$AVGO_Volume, 20)
SSD$ASML_avg_vol_1M <- running_average(SSD$ASML_Volume, 20)
SSD$AMAT_avg_vol_1M <- running_average(SSD$AMAT_Volume, 20)
SSD$TXN_avg_vol_1M <- running_average(SSD$TXN_Volume, 20)

SSD$NVDA_avg_vol_2M <- running_average(SSD$NVDA_Volume, 40)
SSD$TSM_avg_vol_2M <- running_average(SSD$TSM_Volume, 40)
SSD$NXPI_avg_vol_2M <- running_average(SSD$NXPI_Volume, 40)
SSD$QCOM_avg_vol_2M <- running_average(SSD$QCOM_Volume, 40)
SSD$MPWR_avg_vol_2M <- running_average(SSD$MPWR_Volume, 40)
SSD$ON_avg_vol_2M <- running_average(SSD$ON_Volume, 40)
SSD$AMD_avg_vol_2M <- running_average(SSD$AMD_Volume, 40)
SSD$INTC_avg_vol_2M <- running_average(SSD$INTC_Volume, 40)
SSD$AVGO_avg_vol_2M <- running_average(SSD$AVGO_Volume, 40)
SSD$ASML_avg_vol_2M <- running_average(SSD$ASML_Volume, 40)
SSD$AMAT_avg_vol_2M <- running_average(SSD$AMAT_Volume, 40)
SSD$TXN_avg_vol_2M <- running_average(SSD$TXN_Volume, 40)


ggplot(data = SSD, aes(x=Date)) +
  geom_line(aes(y = AMD_avg_vol_2W, color = 'AMD')) +
  geom_line(aes(y = NXPI_avg_vol_2W, color = 'NXPI')) + 
  geom_line(aes(y = TXN_avg_vol_2W, color = 'TXN')) +
  geom_line(aes(y = AMAT_avg_vol_2W, color = 'AMAT')) +
  scale_color_manual(values = c(
    'AMD' = 'green',
    'AMAT' = 'white',
    'NXPI' = 'pink',
    'TXN' = 'lightblue')) +
  ylab('USD') + 
  scale_y_continuous( labels = label_comma()) +
  ggtitle("2-Week Average Volume") +
  theme_dark()


ggplot(data = SSD, aes(x=Date)) +
  geom_line(aes(y = NVDA_avg_vol_2W, color = 'NVDA')) +
  geom_line(aes(y = MPWR_avg_vol_2W, color = 'MPWR')) + 
  geom_line(aes(y = AVGO_avg_vol_2W, color = 'AVGO')) +
  geom_line(aes(y = ASML_avg_vol_2W, color = 'ASML')) +
  scale_color_manual(values = c(
    'NVDA' = 'darkolivegreen1',
    'MPWR' = 'moccasin',
    'AVGO' = 'coral',
    'ASML' = 'gold')) +
  ylab('USD') + 
  scale_y_continuous( labels = label_comma()) +
  ggtitle("2-Week Average Volume") +
  theme_dark()

ggplot(data = SSD, aes(x=Date)) +
  geom_line(aes(y = ON_avg_vol_2W, color = 'ON')) +
  geom_line(aes(y = QCOM_avg_vol_2W, color = 'QCOM')) + 
  geom_line(aes(y = INTC_avg_vol_2W, color = 'INTC')) +
  geom_line(aes(y = TSM_avg_vol_2W, color = 'TSM')) +
  scale_color_manual(values = c(
    'ON' = 'cyan',
    'QCOM' = 'purple',
    'INTC' = 'yellow',
    'TSM' = 'red')) +
  ylab('USD') + 
  scale_y_continuous( labels = label_comma()) +
  ggtitle("2-Week Average Volume") +
  theme_dark()

n-Day Volume Standard Deviation

SSD$NVDA_std_dev_vol_1W <- running_deviation(SSD$NVDA_Volume, 5)
SSD$TSM_std_dev_vol_1W <- running_deviation(SSD$TSM_Volume, 5)
SSD$NXPI_std_dev_vol_1W <- running_deviation(SSD$NXPI_Volume, 5)
SSD$QCOM_std_dev_vol_1W <- running_deviation(SSD$QCOM_Volume, 5)
SSD$MPWR_std_dev_vol_1W <- running_deviation(SSD$MPWR_Volume, 5)
SSD$ON_std_dev_vol_1W <- running_deviation(SSD$ON_Volume, 5)
SSD$AMD_std_dev_vol_1W <- running_deviation(SSD$AMD_Volume, 5)
SSD$INTC_std_dev_vol_1W <- running_deviation(SSD$INTC_Volume, 5)
SSD$AVGO_std_dev_vol_1W <- running_deviation(SSD$AVGO_Volume, 5)
SSD$ASML_std_dev_vol_1W <- running_deviation(SSD$ASML_Volume, 5)
SSD$AMAT_std_dev_vol_1W <- running_deviation(SSD$AMAT_Volume, 5)
SSD$TXN_std_dev_vol_1W <- running_deviation(SSD$TXN_Volume, 5)

SSD$NVDA_std_dev_vol_2W <- running_deviation(SSD$NVDA_Volume, 10)
SSD$TSM_std_dev_vol_2W <- running_deviation(SSD$TSM_Volume, 10)
SSD$NXPI_std_dev_vol_2W <- running_deviation(SSD$NXPI_Volume, 10)
SSD$QCOM_std_dev_vol_2W <- running_deviation(SSD$QCOM_Volume, 10)
SSD$MPWR_std_dev_vol_2W <- running_deviation(SSD$MPWR_Volume, 10)
SSD$ON_std_dev_vol_2W <- running_deviation(SSD$ON_Volume, 10)
SSD$AMD_std_dev_vol_2W <- running_deviation(SSD$AMD_Volume, 10)
SSD$INTC_std_dev_vol_2W <- running_deviation(SSD$INTC_Volume, 10)
SSD$AVGO_std_dev_vol_2W <- running_deviation(SSD$AVGO_Volume, 10)
SSD$ASML_std_dev_vol_2W <- running_deviation(SSD$ASML_Volume, 10)
SSD$AMAT_std_dev_vol_2W <- running_deviation(SSD$AMAT_Volume, 10)
SSD$TXN_std_dev_vol_2W <- running_deviation(SSD$TXN_Volume, 10)

SSD$NVDA_std_dev_vol_1M <- running_deviation(SSD$NVDA_Volume, 20)
SSD$TSM_std_dev_vol_1M <- running_deviation(SSD$TSM_Volume, 20)
SSD$NXPI_std_dev_vol_1M <- running_deviation(SSD$NXPI_Volume, 20)
SSD$QCOM_std_dev_vol_1M <- running_deviation(SSD$QCOM_Volume, 20)
SSD$MPWR_std_dev_vol_1M <- running_deviation(SSD$MPWR_Volume, 20)
SSD$ON_std_dev_vol_1M <- running_deviation(SSD$ON_Volume, 20)
SSD$AMD_std_dev_vol_1M <- running_deviation(SSD$AMD_Volume, 20)
SSD$INTC_std_dev_vol_1M <- running_deviation(SSD$INTC_Volume, 20)
SSD$AVGO_std_dev_vol_1M <- running_deviation(SSD$AVGO_Volume, 20)
SSD$ASML_std_dev_vol_1M <- running_deviation(SSD$ASML_Volume, 20)
SSD$AMAT_std_dev_vol_1M <- running_deviation(SSD$AMAT_Volume, 20)
SSD$TXN_std_dev_vol_1M <- running_deviation(SSD$TXN_Volume, 20)

SSD$NVDA_std_dev_vol_2M <- running_deviation(SSD$NVDA_Volume, 40)
SSD$TSM_std_dev_vol_2M <- running_deviation(SSD$TSM_Volume, 40)
SSD$NXPI_std_dev_vol_2M <- running_deviation(SSD$NXPI_Volume, 40)
SSD$QCOM_std_dev_vol_2M <- running_deviation(SSD$QCOM_Volume, 40)
SSD$MPWR_std_dev_vol_2M <- running_deviation(SSD$MPWR_Volume, 40)
SSD$ON_std_dev_vol_2M <- running_deviation(SSD$ON_Volume, 40)
SSD$AMD_std_dev_vol_2M <- running_deviation(SSD$AMD_Volume, 40)
SSD$INTC_std_dev_vol_2M <- running_deviation(SSD$INTC_Volume, 40)
SSD$AVGO_std_dev_vol_2M <- running_deviation(SSD$AVGO_Volume, 40)
SSD$ASML_std_dev_vol_2M <- running_deviation(SSD$ASML_Volume, 40)
SSD$AMAT_std_dev_vol_2M <- running_deviation(SSD$AMAT_Volume, 40)
SSD$TXN_std_dev_vol_2M <- running_deviation(SSD$TXN_Volume, 40)


ggplot(data = SSD, aes(x=Date)) +
  geom_line(aes(y = AMD_std_dev_vol_2W, color = 'AMD')) +
  geom_line(aes(y = NXPI_std_dev_vol_2W, color = 'NXPI')) + 
  geom_line(aes(y = TXN_std_dev_vol_2W, color = 'TXN')) +
  geom_line(aes(y = AMAT_std_dev_vol_2W, color = 'AMAT')) +
  scale_color_manual(values = c(
    'AMD' = 'green',
    'AMAT' = 'white',
    'NXPI' = 'pink',
    'TXN' = 'lightblue')) +
  ylab('USD') + 
  scale_y_continuous( labels = label_comma()) +
  ggtitle("2-Week Standard Deviation") +
  theme_dark()


ggplot(data = SSD, aes(x=Date)) +
  geom_line(aes(y = NVDA_std_dev_vol_2W, color = 'NVDA')) +
  geom_line(aes(y = MPWR_std_dev_vol_2W, color = 'MPWR')) + 
  geom_line(aes(y = AVGO_std_dev_vol_2W, color = 'AVGO')) +
  geom_line(aes(y = ASML_std_dev_vol_2W, color = 'ASML')) +
  scale_color_manual(values = c(
    'NVDA' = 'darkolivegreen1',
    'MPWR' = 'moccasin',
    'AVGO' = 'coral',
    'ASML' = 'gold')) +
  ylab('USD') + 
  scale_y_continuous( labels = label_comma()) +
  ggtitle("2-Week Standard Deviation") +
  theme_dark()

ggplot(data = SSD, aes(x=Date)) +
  geom_line(aes(y = ON_std_dev_vol_2W, color = 'ON')) +
  geom_line(aes(y = QCOM_std_dev_vol_2W, color = 'QCOM')) + 
  geom_line(aes(y = INTC_std_dev_vol_2W, color = 'INTC')) +
  geom_line(aes(y = TSM_std_dev_vol_2W, color = 'TSM')) +
  scale_color_manual(values = c(
    'ON' = 'cyan',
    'QCOM' = 'purple',
    'INTC' = 'yellow',
    'TSM' = 'red')) +
  ylab('USD') + 
  scale_y_continuous( labels = label_comma()) +
  ggtitle("2-Week Standard Deviation") +
  theme_dark()

Data Correlation

While having a large array of predictors is in some sense useful for seeing the whole picture of the semiconductor market for the 2023-2024 fiscal year, there is also a potentially significant amount of unnecessary information. As mentioned prior, the behavior of many of our initial predictors coming from the CSV files are very closely related to one another — the closing price one day is directly tied to the opening price of the following day, and if a stock’s minimum / Low value is increasing that generally means all 4 other predictors (aside from volume) are increasing as well. In addition, comparing the performance between two stocks is generally going to be heavily correlated due to the fact that they both follow the underlying market’s climate.

Ultimately, in order to achieve a good understanding of the correlations between all of our predictors we will need to cross examine several subsets of our predictors to see which predictors are correlated for a single stock, and which predictors are useful for measuring competition between stocks. Dividing our correlation plots into two types, we first examine how the predictors are correlated for a fixed stock, and test this underlying trend accross a subset of our stocks (ASML, INTC, NVDA, and NXPI ):

select(SSD, starts_with("INTC")) %>%
  cor() %>%
  corrplot(method = "circle", type = "lower", diag = FALSE, tl.cex=0.6, title="INTC Correlation Plot")

select(SSD, starts_with("NXPI")) %>%
  cor() %>%
  corrplot(method = "circle", type = "lower", diag = FALSE, tl.cex=0.6, title="NXPI Correlation Plot")

select(SSD, starts_with("NVDA")) %>%
  cor() %>%
  corrplot(method = "circle", type = "lower", diag = FALSE, tl.cex=0.6, title="NVDA Correlation Plot")

select(SSD, starts_with("ASML")) %>%
  cor() %>%
  corrplot(method = "circle", type = "lower", diag = FALSE, tl.cex=0.6, title="ASML Correlation Plot")

From the analysis above, we can somewhat immediately conclude that five of our six original predictors from the CSV file (everything except volume) are very closely correlated. Additionally, since the running averages are defined in terms of the closing cost for each stock individually, it is no surprise that for longer time intervals the running average is closely correlated to the closing cost and thus the remaining original predictors. However, one of the more surprising correlations that one might not have expected is the mildly positive relationship between the volume of a stock and its normalized volatility — in fact, one might have initially expected a stock that behaves more unpredictably would be traded less, though the correlation plot indicates otherwise. Laslty, we can see from each of the added predictors based on a time-window hyper-parameter that larger values of the time window (i.e. two months) give little to no new insight into the behavior of a stocks value.

With the predictors for a fixed stock thoroughly analyzed, the next important subset of predictors to cross-examine is when the predictor type is fixed and the stock itself is allowed to vary. As we saw from the correlation plots above, several of our predictors for a fixed stock are closely related to one another — thus, there isn’t any reason to examine all 13 predictors across our different manufacturers. Instead, we focus on a subset that has minimal pairwise-correlation: Volume, 2-Week Average (Closing) Price, 2-Week Standard Deviation, and 2-Week Normalized Volatility.

select(SSD, ends_with("avg_vol_2W")) %>%
  cor() %>%
  corrplot(method = "circle", type = "lower", tl.cex=0.75, diag = FALSE, title="Volumes")

select(SSD, ends_with("avg_cl_2W")) %>%
  cor() %>%
  corrplot(method = "circle", type = "lower", tl.cex=0.75, diag = FALSE, title="2-Week Averages")

select(SSD, ends_with("std_dev_cl_2W")) %>%
  cor() %>%
  corrplot(method = "circle", type = "lower", tl.cex=0.75, diag = FALSE, title="2-Week Standard Deviations")

select(SSD, ends_with("avg_ret_2W")) %>%
  cor() %>%
  corrplot(method = "circle", type = "lower", tl.cex=0.75, diag = FALSE, title="2-Week Normalized Volatilities")

There are a few key takeaways from this correlation analysis; foremost, there is not a significant relationship in the volume of stocks sold between any two companies (besides possibly TXN and NXPI). Second, the fact that most manufacturers’ closing stock prices are heavily correlated means they are much more heavily affected by the overall market trends than competitors’ actions — however, there is one exception to this trend: ON Semiconductor Corporation. Lastly,

Setting Up Models

With a better picture in mind of how our stock prices can be measured from both the given metrics and how they interact with one another, we can now set up our data and begin training our models. This will be done in several steps, first preparing the data to ensure that our models do not become over-fitted to a particular data-set.

Data Split

One of the primary ways we ensure robustness of our models is by partitioning our data into training and testing data. Foremost, this ensures that our model does not become overfit to the details and noise of our underlying data-set by introducing a portion of the data which is unseen during the training phase (i.e. the testing data). Ultimately, one would want outcome variable to have similar statistics / variance across both the training and testing sets — this is accomplished by stratifying our split about the desired outcome variable.

SSD_split_1W <- initial_split(SSD, prop = 0.7,
                                strata = NVDA_avg_cl_1W)
SSD_train_1W <- training(SSD_split_1W)
SSD_test_1W <- testing(SSD_split_1W)


SSD_split_2W <- initial_split(SSD, prop = 0.7,
                                strata = NVDA_avg_cl_2W)
SSD_train_2W <- training(SSD_split_2W)
SSD_test_2W <- testing(SSD_split_2W)



SSD_split_1M <- initial_split(SSD, prop = 0.7,
                                strata = NVDA_avg_cl_1M)
SSD_train_1M <- training(SSD_split_1M)
SSD_test_1M <- testing(SSD_split_1M)



SSD_split_2M <- initial_split(SSD, prop = 0.7,
                                strata = NVDA_avg_cl_2M)
SSD_train_2M <- training(SSD_split_2M)
SSD_test_2M <- testing(SSD_split_2M)

One-Week Model Fitting

 SSD_recipe_1W = recipe(
   NVDA_avg_cl_1W ~ NVDA_std_dev_cl_1W + NVDA_avg_ret_1W + NVDA_std_dev_ret_1W + NVDA_avg_vol_1W + NVDA_std_dev_vol_1W +
     TSM_avg_cl_1W + TSM_std_dev_cl_1W + TSM_avg_ret_1W + TSM_std_dev_ret_1W + TSM_avg_vol_1W + TSM_std_dev_vol_1W +
     NXPI_avg_cl_1W + NXPI_std_dev_cl_1W + NXPI_avg_ret_1W + NXPI_std_dev_ret_1W + NXPI_avg_vol_1W + NXPI_std_dev_vol_1W + QCOM_avg_cl_1W + QCOM_std_dev_cl_1W + QCOM_avg_ret_1W + QCOM_std_dev_ret_1W + QCOM_avg_vol_1W + QCOM_std_dev_vol_1W + MPWR_avg_cl_1W + MPWR_std_dev_cl_1W + MPWR_avg_ret_1W + MPWR_std_dev_ret_1W + MPWR_avg_vol_1W + MPWR_std_dev_vol_1W + ON_avg_cl_1W + ON_std_dev_cl_1W + ON_avg_ret_1W + ON_std_dev_ret_1W + ON_avg_vol_1W + ON_std_dev_vol_1W + AMD_avg_cl_1W + AMD_std_dev_cl_1W + AMD_avg_ret_1W + AMD_std_dev_ret_1W + AMD_avg_vol_1W + AMD_std_dev_vol_1W + INTC_avg_cl_1W + INTC_std_dev_cl_1W + INTC_avg_ret_1W + INTC_std_dev_ret_1W + INTC_avg_vol_1W + INTC_std_dev_vol_1W + AVGO_avg_cl_1W + AVGO_std_dev_cl_1W + AVGO_avg_ret_1W + AVGO_std_dev_ret_1W + AVGO_avg_vol_1W + AVGO_std_dev_vol_1W + ASML_avg_cl_1W + ASML_std_dev_cl_1W + ASML_avg_ret_1W + ASML_std_dev_ret_1W + ASML_avg_vol_1W + ASML_std_dev_vol_1W + AMAT_avg_cl_1W + AMAT_std_dev_cl_1W + AMAT_avg_ret_1W + AMAT_std_dev_ret_1W + AMAT_avg_vol_1W + AMAT_std_dev_vol_1W + TXN_avg_cl_1W + TXN_std_dev_cl_1W + TXN_avg_ret_1W + TXN_std_dev_ret_1W + TXN_avg_vol_1W + TXN_std_dev_vol_1W,
                     data=SSD_train_1W) %>%
  step_center(all_predictors()) %>% 
  step_scale(all_predictors())

 SSD_recipe_2W = recipe(
   NVDA_avg_cl_2W ~ NVDA_std_dev_cl_2W + NVDA_avg_ret_2W + NVDA_std_dev_ret_2W + NVDA_avg_vol_2W + NVDA_std_dev_vol_2W +
     TSM_avg_cl_2W + TSM_std_dev_cl_2W + TSM_avg_ret_2W + TSM_std_dev_ret_2W + TSM_avg_vol_2W + TSM_std_dev_vol_2W +
     NXPI_avg_cl_2W + NXPI_std_dev_cl_2W + NXPI_avg_ret_2W + NXPI_std_dev_ret_2W + NXPI_avg_vol_2W + NXPI_std_dev_vol_2W + QCOM_avg_cl_2W + QCOM_std_dev_cl_2W + QCOM_avg_ret_2W + QCOM_std_dev_ret_2W + QCOM_avg_vol_2W + QCOM_std_dev_vol_2W + MPWR_avg_cl_2W + MPWR_std_dev_cl_2W + MPWR_avg_ret_2W + MPWR_std_dev_ret_2W + MPWR_avg_vol_2W + MPWR_std_dev_vol_2W + ON_avg_cl_2W + ON_std_dev_cl_2W + ON_avg_ret_2W + ON_std_dev_ret_2W + ON_avg_vol_2W + ON_std_dev_vol_2W + AMD_avg_cl_2W + AMD_std_dev_cl_2W + AMD_avg_ret_2W + AMD_std_dev_ret_2W + AMD_avg_vol_2W + AMD_std_dev_vol_2W + INTC_avg_cl_2W + INTC_std_dev_cl_2W + INTC_avg_ret_2W + INTC_std_dev_ret_2W + INTC_avg_vol_2W + INTC_std_dev_vol_2W + AVGO_avg_cl_2W + AVGO_std_dev_cl_2W + AVGO_avg_ret_2W + AVGO_std_dev_ret_2W + AVGO_avg_vol_2W + AVGO_std_dev_vol_2W + ASML_avg_cl_2W + ASML_std_dev_cl_2W + ASML_avg_ret_2W + ASML_std_dev_ret_2W + ASML_avg_vol_2W + ASML_std_dev_vol_2W + AMAT_avg_cl_2W + AMAT_std_dev_cl_2W + AMAT_avg_ret_2W + AMAT_std_dev_ret_2W + AMAT_avg_vol_2W + AMAT_std_dev_vol_2W + TXN_avg_cl_2W + TXN_std_dev_cl_2W + TXN_avg_ret_2W + TXN_std_dev_ret_2W + TXN_avg_vol_2W + TXN_std_dev_vol_2W,
                     data=SSD_train_2W) %>%
  step_center(all_predictors()) %>% 
  step_scale(all_predictors())

  SSD_recipe_1M = recipe(
   NVDA_avg_cl_1M ~ NVDA_std_dev_cl_1M + NVDA_avg_ret_1M + NVDA_std_dev_ret_1M + NVDA_avg_vol_1M + NVDA_std_dev_vol_1M +
     TSM_avg_cl_1M + TSM_std_dev_cl_1M + TSM_avg_ret_1M + TSM_std_dev_ret_1M + TSM_avg_vol_1M + TSM_std_dev_vol_1M +
     NXPI_avg_cl_1M + NXPI_std_dev_cl_1M + NXPI_avg_ret_1M + NXPI_std_dev_ret_1M + NXPI_avg_vol_1M + NXPI_std_dev_vol_1M + QCOM_avg_cl_1M + QCOM_std_dev_cl_1M + QCOM_avg_ret_1M + QCOM_std_dev_ret_1M + QCOM_avg_vol_1M + QCOM_std_dev_vol_1M + MPWR_avg_cl_1M + MPWR_std_dev_cl_1M + MPWR_avg_ret_1M + MPWR_std_dev_ret_1M + MPWR_avg_vol_1M + MPWR_std_dev_vol_1M + ON_avg_cl_1M + ON_std_dev_cl_1M + ON_avg_ret_1M + ON_std_dev_ret_1M + ON_avg_vol_1M + ON_std_dev_vol_1M + AMD_avg_cl_1M + AMD_std_dev_cl_1M + AMD_avg_ret_1M + AMD_std_dev_ret_1M + AMD_avg_vol_1M + AMD_std_dev_vol_1M + INTC_avg_cl_1M + INTC_std_dev_cl_1M + INTC_avg_ret_1M + INTC_std_dev_ret_1M + INTC_avg_vol_1M + INTC_std_dev_vol_1M + AVGO_avg_cl_1M + AVGO_std_dev_cl_1M + AVGO_avg_ret_1M + AVGO_std_dev_ret_1M + AVGO_avg_vol_1M + AVGO_std_dev_vol_1M + ASML_avg_cl_1M + ASML_std_dev_cl_1M + ASML_avg_ret_1M + ASML_std_dev_ret_1M + ASML_avg_vol_1M + ASML_std_dev_vol_1M + AMAT_avg_cl_1M + AMAT_std_dev_cl_1M + AMAT_avg_ret_1M + AMAT_std_dev_ret_1M + AMAT_avg_vol_1M + AMAT_std_dev_vol_1M + TXN_avg_cl_1M + TXN_std_dev_cl_1M + TXN_avg_ret_1M + TXN_std_dev_ret_1M + TXN_avg_vol_1M + TXN_std_dev_vol_1M,
                     data=SSD_train_1M) %>%
  step_center(all_predictors()) %>% 
  step_scale(all_predictors())
  
   SSD_recipe_2M = recipe(
   NVDA_avg_cl_2M ~ NVDA_std_dev_cl_2M + NVDA_avg_ret_2M + NVDA_std_dev_ret_2M + NVDA_avg_vol_2M + NVDA_std_dev_vol_2M +
     TSM_avg_cl_2M + TSM_std_dev_cl_2M + TSM_avg_ret_2M + TSM_std_dev_ret_2M + TSM_avg_vol_2M + TSM_std_dev_vol_2M +
     NXPI_avg_cl_2M + NXPI_std_dev_cl_2M + NXPI_avg_ret_2M + NXPI_std_dev_ret_2M + NXPI_avg_vol_2M + NXPI_std_dev_vol_2M + QCOM_avg_cl_2M + QCOM_std_dev_cl_2M + QCOM_avg_ret_2M + QCOM_std_dev_ret_2M + QCOM_avg_vol_2M + QCOM_std_dev_vol_2M + MPWR_avg_cl_2M + MPWR_std_dev_cl_2M + MPWR_avg_ret_2M + MPWR_std_dev_ret_2M + MPWR_avg_vol_2M + MPWR_std_dev_vol_2M + ON_avg_cl_2M + ON_std_dev_cl_2M + ON_avg_ret_2M + ON_std_dev_ret_2M + ON_avg_vol_2M + ON_std_dev_vol_2M + AMD_avg_cl_2M + AMD_std_dev_cl_2M + AMD_avg_ret_2M + AMD_std_dev_ret_2M + AMD_avg_vol_2M + AMD_std_dev_vol_2M + INTC_avg_cl_2M + INTC_std_dev_cl_2M + INTC_avg_ret_2M + INTC_std_dev_ret_2M + INTC_avg_vol_2M + INTC_std_dev_vol_2M + AVGO_avg_cl_2M + AVGO_std_dev_cl_2M + AVGO_avg_ret_2M + AVGO_std_dev_ret_2M + AVGO_avg_vol_2M + AVGO_std_dev_vol_2M + ASML_avg_cl_2M + ASML_std_dev_cl_2M + ASML_avg_ret_2M + ASML_std_dev_ret_2M + ASML_avg_vol_2M + ASML_std_dev_vol_2M + AMAT_avg_cl_2M + AMAT_std_dev_cl_2M + AMAT_avg_ret_2M + AMAT_std_dev_ret_2M + AMAT_avg_vol_2M + AMAT_std_dev_vol_2M + TXN_avg_cl_2M + TXN_std_dev_cl_2M + TXN_avg_ret_2M + TXN_std_dev_ret_2M + TXN_avg_vol_2M + TXN_std_dev_vol_2M,
                     data=SSD_train_2M) %>%
  step_center(all_predictors()) %>% 
  step_scale(all_predictors())

k-Fold Cross Validation

SSD_folds_1W <- vfold_cv(SSD_train_1W, v = 10, strata = NVDA_avg_cl_1W)
SSD_folds_2W <- vfold_cv(SSD_train_2W, v = 10, strata = NVDA_avg_cl_2W)
SSD_folds_1M <- vfold_cv(SSD_train_1M, v = 10, strata = NVDA_avg_cl_1M)
SSD_folds_2M <- vfold_cv(SSD_train_2M, v = 10, strata = NVDA_avg_cl_2M)

Fitting the Models

# Linear Regression
lm_model <- linear_reg() %>% 
  set_engine("lm")


# Ridge Regression
ridge_model <- linear_reg(mixture = 0, 
                         penalty = tune()) %>% 
  set_mode("regression") %>% 
  set_engine("glmnet")

# Lasso Regression
lasso_model <- linear_reg(mixture = 1, 
                         penalty = tune()) %>% 
  set_mode("regression") %>% 
  set_engine("glmnet")


# Elastic Net
elastic_net_model <- linear_reg(mixture = tune(), 
                              penalty = tune()) %>% 
  set_mode("regression") %>%
  set_engine("glmnet")

# k-Nearest Neighbors (k = 7)
knn_model <- nearest_neighbor(neighbors = tune()) %>% 
  set_engine("kknn") %>% 
  set_mode("regression")

Set Up Workflows

# Linear Regression Workflows 
lm_wflow_1W <- workflow() %>% 
  add_model(lm_model) %>% 
  add_recipe(SSD_recipe_1W)
lm_wflow_2W <- workflow() %>% 
  add_model(lm_model) %>% 
  add_recipe(SSD_recipe_2W)
lm_wflow_1M <- workflow() %>% 
  add_model(lm_model) %>% 
  add_recipe(SSD_recipe_1M)
lm_wflow_2M <- workflow() %>% 
  add_model(lm_model) %>% 
  add_recipe(SSD_recipe_2M)

# Ridge Regression Workflows
ridge_wflow_1W <- workflow() %>% 
  add_model(ridge_model) %>% 
  add_recipe(SSD_recipe_1W)
ridge_wflow_2W <- workflow() %>% 
  add_model(ridge_model) %>% 
  add_recipe(SSD_recipe_2W)
ridge_wflow_1M <- workflow() %>% 
  add_model(ridge_model) %>% 
  add_recipe(SSD_recipe_1M)
ridge_wflow_2M <- workflow() %>% 
  add_model(ridge_model) %>% 
  add_recipe(SSD_recipe_2M)

# Lasso Regression Workflows
lasso_wflow_1W <- workflow() %>% 
  add_model(lasso_model) %>% 
  add_recipe(SSD_recipe_1W)
lasso_wflow_2W <- workflow() %>% 
  add_model(lasso_model) %>% 
  add_recipe(SSD_recipe_2W)
lasso_wflow_1M <- workflow() %>% 
  add_model(lasso_model) %>% 
  add_recipe(SSD_recipe_1M)
lasso_wflow_2M <- workflow() %>% 
  add_model(lasso_model) %>% 
  add_recipe(SSD_recipe_2M)

# Elastic Net Workflows
elastic_net_wflow_1W <- workflow() %>% 
  add_model(elastic_net_model) %>% 
  add_recipe(SSD_recipe_1W)
elastic_net_wflow_2W <- workflow() %>% 
  add_model(elastic_net_model) %>% 
  add_recipe(SSD_recipe_2W)
elastic_net_wflow_1M <- workflow() %>% 
  add_model(elastic_net_model) %>% 
  add_recipe(SSD_recipe_1M)
elastic_net_wflow_2M <- workflow() %>% 
  add_model(elastic_net_model) %>% 
  add_recipe(SSD_recipe_2M)

# k-Nearest Neighbors Workflows
knn_wflow_1W <- workflow() %>% 
  add_model(knn_model) %>% 
  add_recipe(SSD_recipe_1W)
knn_wflow_2W <- workflow() %>% 
  add_model(knn_model) %>% 
  add_recipe(SSD_recipe_2W)
knn_wflow_1M <- workflow() %>% 
  add_model(knn_model) %>% 
  add_recipe(SSD_recipe_1M)
knn_wflow_2M <- workflow() %>% 
  add_model(knn_model) %>% 
  add_recipe(SSD_recipe_2M)

Hyperparameter Tuning

Set up Grids:

# Grid for Ridge Regression and Lasso Regression
no_mixture_grid <- grid_regular(penalty(range = c(0,1)), levels = 50)

# Grid for Elastic Net
elastic_net_grid <- grid_regular(penalty(range = c(0, 1),
                                     trans = identity_trans()),
                        mixture(range = c(0, 1)),
                             levels = 10)

# k-Nearest Neighbors Net 
knn_grid <- grid_regular(neighbors(range = c(2,20)), levels = 10)

Tune Parameters for 1-Week Recipe

# Find optimal parameters for ridge regression
ridge_tune_1W <- tune_grid(
  ridge_wflow_1W,
  resamples = SSD_folds_1W,
  grid = no_mixture_grid
)
ridge_final_wflow_1W <- select_best(ridge_tune_1W, metric="rmse" ) %>%
  finalize_workflow(x=ridge_wflow_1W)

# Find optimal parameters for lasso regression
lasso_tune_1W <- tune_grid(
  lasso_wflow_1W,
  resamples = SSD_folds_1W,
  grid = no_mixture_grid
)
lasso_final_wflow_1W <- select_best(lasso_tune_1W, metric="rmse") %>%
  finalize_workflow(x=lasso_wflow_1W)

# Find optimal parameters for Elastic Net
elastic_net_tune_1W <- tune_grid(
  elastic_net_wflow_1W,
  resamples = SSD_folds_1W,
  grid = elastic_net_grid
)
elastic_net_final_wflow_1W <- select_best(elastic_net_tune_1W, metric = "rmse") %>% 
  finalize_workflow(x=elastic_net_wflow_1W)

# Find optimal parameters for k-Nearest Neighbors
knn_tune_1W <- tune_grid(
    knn_wflow_1W,
    resamples = SSD_folds_1W,
    grid = knn_grid
)
knn_final_wflow_1W <- select_best(knn_tune_1W, metric = "rmse") %>%
  finalize_workflow(x=knn_wflow_1W)

Tune Parameters for 2-Week Recipe

# Find optimal parameters for ridge regression
ridge_tune_2W <- tune_grid(
  ridge_wflow_2W,
  resamples = SSD_folds_2W,
  grid = no_mixture_grid
)
ridge_final_wflow_2W <- select_best(ridge_tune_2W, metric="rmse" ) %>%
  finalize_workflow(x=ridge_wflow_2W)

# Find optimal parameters for lasso regression
lasso_tune_2W <- tune_grid(
  lasso_wflow_2W,
  resamples = SSD_folds_2W,
  grid = no_mixture_grid
)
lasso_final_wflow_2W <- select_best(lasso_tune_2W, metric="rmse") %>%
  finalize_workflow(x=lasso_wflow_2W)

# Find optimal parameters for Elastic Net
elastic_net_tune_2W <- tune_grid(
  elastic_net_wflow_2W,
  resamples = SSD_folds_2W,
  grid = elastic_net_grid
)
elastic_net_final_wflow_2W <- select_best(elastic_net_tune_2W,  metric = "rmse") %>% 
  finalize_workflow(x=elastic_net_wflow_2W)

# Find optimal parameters for k-Nearest Neighbors
knn_tune_2W <- tune_grid(
    knn_wflow_2W,
    resamples = SSD_folds_2W,
    grid = knn_grid
)
knn_final_wflow_2W <- select_best(knn_tune_2W, metric = "rmse") %>% 
  finalize_workflow(x=knn_wflow_2W)

Tune Parameters for 1-Month Recipe

# Find optimal parameters for ridge regression
ridge_tune_1M <- tune_grid(
  ridge_wflow_1M,
  resamples = SSD_folds_1M,
  grid = no_mixture_grid
)
ridge_final_wflow_1M <- select_best(ridge_tune_1M, metric="rmse" ) %>% 
  finalize_workflow(x=ridge_wflow_1M)
 
# Find optimal parameters for lasso regression
lasso_tune_1M <- tune_grid(
  lasso_wflow_1M,
  resamples = SSD_folds_1M,
  grid = no_mixture_grid
)
lasso_final_wflow_1M <- select_best(lasso_tune_1M, metric="rmse") %>%
  finalize_workflow(x=lasso_wflow_1M)

# Find optimal parameters for Elastic Net
elastic_net_tune_1M <- tune_grid(
  elastic_net_wflow_1M,
  resamples = SSD_folds_1M,
  grid = elastic_net_grid
)
elastic_net_final_wflow_1M <- select_best(elastic_net_tune_1M, metric = "rmse") %>% 
  finalize_workflow(x=elastic_net_wflow_1M)

# Find optimal parameters for k-Nearest Neighbors
knn_tune_1M <- tune_grid(
    knn_wflow_1M,
    resamples = SSD_folds_1M,
    grid = knn_grid
)
knn_final_wflow_1M <- select_best(knn_tune_1M, metric = "rmse") %>%
  finalize_workflow(x=knn_wflow_1M)

Tune Parameters for 2-Month Recipe

# Find optimal parameters for ridge regression
ridge_tune_2M <- tune_grid(
  ridge_wflow_2M,
  resamples = SSD_folds_2M,
  grid = no_mixture_grid
)
ridge_final_wflow_2M <-  select_best(ridge_tune_2M, metric="rmse" ) %>%
  finalize_workflow(x=ridge_wflow_2M)

# Find optimal parameters for lasso regression
lasso_tune_2M <- tune_grid(
  lasso_wflow_2M,
  resamples = SSD_folds_2M,
  grid = no_mixture_grid
)
lasso_final_wflow_2M <-select_best(lasso_tune_2M, metric="rmse") %>%
  finalize_workflow(x=lasso_wflow_2M)

# Find optimal parameters for Elastic Net
elastic_net_tune_2M <- tune_grid(
  elastic_net_wflow_2M,
  resamples = SSD_folds_2M,
  grid = elastic_net_grid
)
elastic_net_final_wflow_2M <- select_best(elastic_net_tune_2M, metric = "rmse") %>%
  finalize_workflow(x=elastic_net_wflow_2M)


# Find optimal parameters for k-Nearest Neighbors
knn_tune_2M <- tune_grid(
    knn_wflow_2M,
    resamples = SSD_folds_2M,
    grid = knn_grid
)
knn_final_wflow_2M <- select_best(knn_tune_2M, metric = "rmse") %>%
  finalize_workflow(x=knn_wflow_2M)

Model Fitting

# Linear Regression Fits
lm_fit_1W <- fit(lm_wflow_1W, SSD_train_1W)
lm_fit_2W <- fit(lm_wflow_2W, SSD_train_2W)
lm_fit_1M <- fit(lm_wflow_1M, SSD_train_1M)
lm_fit_2M <- fit(lm_wflow_2M, SSD_train_2M)

# Ridge Regression Fits
ridge_fit_1W <- fit(ridge_final_wflow_1W, SSD_train_1W)
ridge_fit_2W <- fit(ridge_final_wflow_2W, SSD_train_2W)
ridge_fit_1M <- fit(ridge_final_wflow_1M, SSD_train_1M)
ridge_fit_2M <- fit(ridge_final_wflow_2M, SSD_train_2M)

# Lasso Regression Fits
lasso_fit_1W <- fit(lasso_final_wflow_1W, SSD_train_1W)
lasso_fit_2W <- fit(lasso_final_wflow_2W, SSD_train_2W)
lasso_fit_1M <- fit(lasso_final_wflow_1M, SSD_train_1M)
lasso_fit_2M <- fit(lasso_final_wflow_2M, SSD_train_2M)

# Elastic Net Fits
elastic_net_fit_1W <- fit(elastic_net_final_wflow_1W, SSD_train_1W)
elastic_net_fit_2W <- fit(elastic_net_final_wflow_2W, SSD_train_2W)
elastic_net_fit_1M <- fit(elastic_net_final_wflow_1M, SSD_train_1M)
elastic_net_fit_2M <- fit(elastic_net_final_wflow_2M, SSD_train_2M)

# k-Nearest Neighbors Fit
knn_fit_1W <- fit(knn_final_wflow_1W, SSD_train_1W)
knn_fit_2W <- fit(knn_final_wflow_2W, SSD_train_2W)
knn_fit_1M <- fit(knn_final_wflow_1M, SSD_train_1M)
knn_fit_2M <- fit(knn_final_wflow_2M, SSD_train_2M)

Model Results for predicting 1-Week Average of NVIDIA Stock Price

# Linear Regression Training 
lm_train_res_1W <- predict(lm_fit_1W, new_data = SSD_train_1W %>% select(-NVDA_avg_cl_1W))
lm_train_res_1W <- bind_cols(lm_train_res_1W, SSD_train_1W %>% select(NVDA_avg_cl_1W))

lm_train_res_2W <- predict(lm_fit_2W, new_data = SSD_train_2W %>% select(-NVDA_avg_cl_2W))
lm_train_res_2W <- bind_cols(lm_train_res_2W, SSD_train_2W %>% select(NVDA_avg_cl_2W))

lm_train_res_1M <- predict(lm_fit_1M, new_data = SSD_train_1M %>% select(-NVDA_avg_cl_1M))
lm_train_res_1M <- bind_cols(lm_train_res_1M, SSD_train_1M %>% select(NVDA_avg_cl_1M))

lm_train_res_2M <- predict(lm_fit_2M, new_data = SSD_train_2M %>% select(-NVDA_avg_cl_2M))
lm_train_res_2M <- bind_cols(lm_train_res_2M, SSD_train_2M %>% select(NVDA_avg_cl_2M))


# Ridge Regression Training
ridge_train_res_1W <- predict(ridge_fit_1W, new_data = SSD_train_1W %>% select(-NVDA_avg_cl_1W))
ridge_train_res_1W <- bind_cols(ridge_train_res_1W, SSD_train_1W %>% select(NVDA_avg_cl_1W))

ridge_train_res_2W <- predict(ridge_fit_2W, new_data = SSD_train_2W %>% select(-NVDA_avg_cl_2W))
ridge_train_res_2W <- bind_cols(ridge_train_res_2W, SSD_train_2W %>% select(NVDA_avg_cl_2W))

ridge_train_res_1M <- predict(ridge_fit_1M, new_data = SSD_train_1M %>% select(-NVDA_avg_cl_1M))
ridge_train_res_1M <- bind_cols(ridge_train_res_1M, SSD_train_1M %>% select(NVDA_avg_cl_1M))

ridge_train_res_2M <- predict(ridge_fit_2M, new_data = SSD_train_2M %>% select(-NVDA_avg_cl_2M))
ridge_train_res_2M <- bind_cols(ridge_train_res_2M, SSD_train_2M %>% select(NVDA_avg_cl_2M))


# Lasso Regression Training
lasso_train_res_1W <- predict(lasso_fit_1W, new_data = SSD_train_1W %>% select(-NVDA_avg_cl_1W))
lasso_train_res_1W <- bind_cols(lasso_train_res_1W, SSD_train_1W %>% select(NVDA_avg_cl_1W))

lasso_train_res_2W <- predict(lasso_fit_2W, new_data = SSD_train_2W %>% select(-NVDA_avg_cl_2W))
lasso_train_res_2W <- bind_cols(lasso_train_res_2W, SSD_train_2W %>% select(NVDA_avg_cl_2W))

lasso_train_res_1M <- predict(lasso_fit_1M, new_data = SSD_train_1M %>% select(-NVDA_avg_cl_1M))
lasso_train_res_1M <- bind_cols(lasso_train_res_1M, SSD_train_1M %>% select(NVDA_avg_cl_1M))

lasso_train_res_2M <- predict(lasso_fit_2M, new_data = SSD_train_2M %>% select(-NVDA_avg_cl_2M))
lasso_train_res_2M <- bind_cols(lasso_train_res_2M, SSD_train_2M %>% select(NVDA_avg_cl_2M))

# Elastic Net Training
elastic_net_train_res_1W <- predict(elastic_net_fit_1W, new_data = SSD_train_1W %>% select(-NVDA_avg_cl_1W))
elastic_net_train_res_1W <- bind_cols(elastic_net_train_res_1W, SSD_train_1W %>% select(NVDA_avg_cl_1W))

elastic_net_train_res_2W <- predict(elastic_net_fit_2W, new_data = SSD_train_2W %>% select(-NVDA_avg_cl_2W))
elastic_net_train_res_2W <- bind_cols(elastic_net_train_res_2W, SSD_train_2W %>% select(NVDA_avg_cl_2W))

elastic_net_train_res_1M <- predict(elastic_net_fit_1M, new_data = SSD_train_1M %>% select(-NVDA_avg_cl_1M))
elastic_net_train_res_1M <- bind_cols(elastic_net_train_res_1M, SSD_train_1M %>% select(NVDA_avg_cl_1M))

elastic_net_train_res_2M <- predict(elastic_net_fit_2M, new_data = SSD_train_2M %>% select(-NVDA_avg_cl_2M))
elastic_net_train_res_2M <- bind_cols(elastic_net_train_res_2M, SSD_train_2M %>% select(NVDA_avg_cl_2M))

# k-Nearest Neighbors Training
knn_train_res_1W <- predict(knn_fit_1W, new_data = SSD_train_1W %>% select(-NVDA_avg_cl_1W))
knn_train_res_1W <- bind_cols(knn_train_res_1W, SSD_train_1W %>% select(NVDA_avg_cl_1W))

knn_train_res_2W <- predict(knn_fit_2W, new_data = SSD_train_2W %>% select(-NVDA_avg_cl_2W))
knn_train_res_2W <- bind_cols(knn_train_res_2W, SSD_train_2W %>% select(NVDA_avg_cl_2W))

knn_train_res_1M <- predict(knn_fit_1M, new_data = SSD_train_1M %>% select(-NVDA_avg_cl_1M))
knn_train_res_1M <- bind_cols(knn_train_res_1M, SSD_train_1M %>% select(NVDA_avg_cl_1M))

knn_train_res_2M <- predict(knn_fit_2M, new_data = SSD_train_2M %>% select(-NVDA_avg_cl_2M))
knn_train_res_2M <- bind_cols(knn_train_res_2M, SSD_train_2M %>% select(NVDA_avg_cl_2M))

Model Accuracies

Root Mean Square Error (RMSE) results:

tibble(Model = c("Linear Regression", "Ridge Regression", "Lasso Regression", "Elastic Net", "k-Nearest Neighbors"),
       One_Week = c((lm_train_res_1W %>% rmse( NVDA_avg_cl_1W, .pred))$.estimate,
                    (ridge_train_res_1W %>% rmse( NVDA_avg_cl_1W, .pred))$.estimate,
                    (lasso_train_res_1W %>% rmse( NVDA_avg_cl_1W, .pred))$.estimate,
                    (elastic_net_train_res_1W %>% rmse( NVDA_avg_cl_1W, .pred))$.estimate,
                    (knn_train_res_1W %>% rmse( NVDA_avg_cl_1W, .pred))$.estimate ),
       Two_Week = c((lm_train_res_2W %>% rmse( NVDA_avg_cl_2W, .pred))$.estimate,
                    (ridge_train_res_2W %>% rmse( NVDA_avg_cl_2W, .pred))$.estimate,
                    (lasso_train_res_2W %>% rmse( NVDA_avg_cl_2W, .pred))$.estimate,
                    (elastic_net_train_res_2W %>% rmse( NVDA_avg_cl_2W, .pred))$.estimate,
                    (knn_train_res_2W %>% rmse( NVDA_avg_cl_2W, .pred))$.estimate),
       One_Month = c((lm_train_res_1M %>% rmse( NVDA_avg_cl_1M, .pred))$.estimate,
                    (ridge_train_res_1M %>% rmse( NVDA_avg_cl_1M, .pred))$.estimate,
                    (lasso_train_res_1M %>% rmse( NVDA_avg_cl_1M, .pred))$.estimate,
                    (elastic_net_train_res_1M %>% rmse( NVDA_avg_cl_1M, .pred))$.estimate,
                    (knn_train_res_1M %>% rmse( NVDA_avg_cl_1M, .pred))$.estimate),
       Two_Month = c((lm_train_res_2M %>% rmse( NVDA_avg_cl_2M, .pred))$.estimate,
                    (ridge_train_res_2M %>% rmse( NVDA_avg_cl_2M, .pred))$.estimate,
                    (lasso_train_res_2M %>% rmse( NVDA_avg_cl_2M, .pred))$.estimate,
                    (elastic_net_train_res_2M %>% rmse( NVDA_avg_cl_2M, .pred))$.estimate,
                    (knn_train_res_2M %>% rmse( NVDA_avg_cl_2M, .pred))$.estimate)
       ) %>% 
  kable() %>% 
  kable_styling(full_width = F) %>% 
  scroll_box(width = "100%", height = "200px")
Model One_Week Two_Week One_Month Two_Month
Linear Regression 16.836273 11.6061295 5.869688 1.9282477
Ridge Regression 22.645078 17.1559196 11.596960 7.2480449
Lasso Regression 21.314497 16.8026942 12.539594 8.8715326
Elastic Net 16.914634 11.8368996 6.321052 4.7157312
k-Nearest Neighbors 1.435098 0.7352442 0.571276 0.4497364

R^2 results:

tibble(Model = c("Linear Regression", "Ridge Regression", "Lasso Regression", "Elastic Net", "k-Nearest Neighbors"),
       One_Week = c((lm_train_res_1W %>% rsq( NVDA_avg_cl_1W, .pred))$.estimate,
                    (ridge_train_res_1W %>% rsq( NVDA_avg_cl_1W, .pred))$.estimate,
                    (lasso_train_res_1W %>% rsq( NVDA_avg_cl_1W, .pred))$.estimate,
                    (elastic_net_train_res_1W %>% rsq( NVDA_avg_cl_1W, .pred))$.estimate,
                    (knn_train_res_1W %>% rsq( NVDA_avg_cl_1W, .pred))$.estimate ),
       Two_Week = c((lm_train_res_2W %>% rsq( NVDA_avg_cl_2W, .pred))$.estimate,
                    (ridge_train_res_2W %>% rsq( NVDA_avg_cl_2W, .pred))$.estimate,
                    (lasso_train_res_2W %>% rsq( NVDA_avg_cl_2W, .pred))$.estimate,
                    (elastic_net_train_res_2W %>% rsq( NVDA_avg_cl_2W, .pred))$.estimate,
                    (knn_train_res_2W %>% rsq( NVDA_avg_cl_2W, .pred))$.estimate),
       One_Month = c((lm_train_res_1M %>% rsq( NVDA_avg_cl_1M, .pred))$.estimate,
                    (ridge_train_res_1M %>% rsq( NVDA_avg_cl_1M, .pred))$.estimate,
                    (lasso_train_res_1M %>% rsq( NVDA_avg_cl_1M, .pred))$.estimate,
                    (elastic_net_train_res_1M %>% rsq( NVDA_avg_cl_1M, .pred))$.estimate,
                    (knn_train_res_1M %>% rsq( NVDA_avg_cl_1M, .pred))$.estimate),
       Two_Month = c((lm_train_res_2M %>% rsq( NVDA_avg_cl_2M, .pred))$.estimate,
                    (ridge_train_res_2M %>% rsq( NVDA_avg_cl_2M, .pred))$.estimate,
                    (lasso_train_res_2M %>% rsq( NVDA_avg_cl_2M, .pred))$.estimate,
                    (elastic_net_train_res_2M %>% rsq( NVDA_avg_cl_2M, .pred))$.estimate,
                    (knn_train_res_2M %>% rsq( NVDA_avg_cl_2M, .pred))$.estimate)
       ) %>% 
  kable() %>% 
  kable_styling(full_width = F) %>% 
  scroll_box(width = "100%", height = "200px")
Model One_Week Two_Week One_Month Two_Month
Linear Regression 0.9914555 0.9959253 0.9988330 0.9998424
Ridge Regression 0.9848118 0.9913159 0.9956269 0.9979111
Lasso Regression 0.9863868 0.9915371 0.9947582 0.9967498
Elastic Net 0.9913760 0.9957622 0.9986471 0.9990764
k-Nearest Neighbors 0.9999381 0.9999837 0.9999890 0.9999915